throbber
1540
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 11, NOVEMBER 2008
`
`Energy Efficient Scheduling of Real-Time
`Tasks on Multicore Processors
`
`Euiseong Seo, Jinkyu Jeong, Seonyeong Park, and Joonwon Lee
`
`Abstract Multicore processors deliver a higher throughput at lower power consumption than unicore processors. In the near future,
`they will thus be widely used in mobile real-time systems. There have been many research on energy-efficient scheduling of
`real-time tasks using DVS. These approaches must be modified for multicore processors, however, since normally all the cores in a
`chip must run at the same performance level. Thus, blindly adopting existing DVS algorithms that do not consider the restriction will
`result in a waste of energy. This article suggests Dynamic Repartitioning algorithm based on existing partitioning approaches of
`multiprocessor systems. The algorithm dynamically balances the task loads of multiple cores to optimize power consumption during
`execution. We also suggest Dynamic Core Scaling algorithm, which adjusts the number of active cores to reduce leakage power
`consumption under low load conditions. Simulation results show that Dynamic Repartitioning can produce energy savings of about
`8 percent even with the best energy-efficient partitioning algorithm. The results also show that Dynamic Core Scaling can reduce
`energy consumption by about 26 percent under low load conditions.
`
`Index Terms Real-time systems, real-time scheduling, low-power design, power-aware systems, multicore processors,
`multiprocessor systems.
`

`
`1 INTRODUCTION
`
`MOBILE real-time systems have seen rapidly increasing
`
`use in sensor networks, satellites, and unmanned
`vehicles, as well as personal mobile equipment. Thus, the
`energy efficiency of them is becoming an important issue.
`The processor is one of the most
`important power
`consumers in any computing system. Considering that
`state-of-the-art real-time systems are evolving in complexity
`and scale, the demand for high-performance processors will
`continue to increase. A processor’s performance, however,
`is directly related to its power consumption. As a result, the
`processor power consumption is becoming more important
`issue as their required performance standards increase.
`Over the last decade, manufacturers competed to
`advance the performance of processors by raising the clock
`frequency. However,
`the dynamic power consumption
`Pdynamic of a CMOS-based processor, the power required
`during execution of instructions,
`is related to its clock
`
`frequency f and operating voltage Vdd as Pdynamic / V 2dd f.
`And, the relation Vdd / f also holds in these processors. As
`a result, the dramatically increased power consumption
`caused by high clock frequency has stopped the race, and
`they are now concentrating on other ways to improve
`performance at relatively low clock frequencies.
`
`. E. Seo is with the Department of Computer Science and Engineering,
`Pennsylvania State University, University Park, PA 16803.
`E mail: euiseong@gmail.com.
`. J. Jeong, S. Park, and J. Lee are with the Computer Science Division, Korea
`Advanced Institute of Science and Technology, 373 1 Guseongdong,
`Yuseonggu, Daejeon 305 701, Korea.
`E mail: {jinkyu, parksy}@calab.kaist.ac.kr, joon@kaist.ac.kr.
`
`Manuscript received 29 Oct. 2007; accepted 13 June 2008; published online
`17 June 2008.
`Recommended for acceptance by I. Ahmad, K.W. Cameron, and R. Melhem.
`For information on obtaining reprints of this article, please send e mail to:
`tpds@computer.org, and reference IEEECS Log Number TPDS 2007 10 0397.
`Digital Object Identifier no. 10.1109/TPDS.2008.104.
`
`One of the representative results from this effort is
`multicore architecture [1], which integrates several proces-
`sing units (known as cores) into a single chip. Multicore
`processors, which are quickly becoming mainstream, can
`achieve higher throughput with the same clock frequency.
`Thus, power consumption in them is a linear function of the
`throughput. As the demand for concurrent processing and
`increased energy efficiency grows,
`it
`is expected that
`multicore processors will become widely used in real-time
`systems.
`The problem of scheduling real-time tasks on a multicore
`processor is the same as that of scheduling on a multi-
`processor system. This is an NP-hard problem [2], and
`existing heuristic solutions can be divided into two
`categories. Partitioned scheduling algorithms [3], [4], [5]
`require every execution of a particular task to take place in
`the same processor, while global scheduling algorithms [6],
`[7], [8] permit a given task to be executed upon different
`processors [6]. Partitioned algorithms are based on a divide-
`and-conquer strategy. After all tasks have been assigned to
`their respective cores,
`the tasks in each core can be
`scheduled using well-known algorithms such as Earliest
`Deadline First (EDF) [9] or Rate Monotonic (RM) [10]. Due to
`their simplicity and efficiency, partitioned scheduling
`algorithms are generally preferred over global scheduling
`algorithms.
`In addition to the innovation of multicore architecture,
`many up-to-date processors also use dynamic voltage
`scaling (DVS). DVS adjusts the clock frequency and
`operating voltage on the fly to meet changes in the
`performance demand.
`Multicore processors can also benefit greatly from DVS
`technology. Because all the cores in a chip are in the same
`clock domain, however, they must all operate at the same
`clock frequency and operating voltage [11], [12]. It seems
`
`Authorized licensed use limited to: UNIST. Downloaded on May 26, 2009 at 21:48 from IEEE Xplore. Restrictions apply.
`
`1045-9219/08/$25.00 ß 2008 IEEE
`
`Published by the IEEE Computer Society
`
`Petitioner Samsung Ex-1031, 0001
`
`

`

`SEO ET AL.: ENERGY EFFICIENT SCHEDULING OF REAL TIME TASKS ON MULTICORE PROCESSORS
`
`1541
`
`that this limitation will remain in force for some years at
`least because the design and production of multicore
`processors with independent clock domains is still prohibi-
`tively expensive.
`There has been much research [13], [14], [15], [16], [17]
`on how best to use DVS in a unicore processor for real-
`time tasks.
`In systems consisting of multiple DVS
`processors, DVS scheduling is easily accomplished using
`those existing algorithms on each processor after partition-
`ing [13], [14], [15]. In multicore environments, however,
`the benefit of this approach is greatly reduced by the
`limitation that all cores must share the same clock. Even
`though the performance demands of each core may differ
`at a given scheduling point, this limitation forces all cores
`to work at the highest frequency scheduled. Compared to
`a multiprocessor system, a multicore system will
`thus
`consume more power needlessly if
`the existing DVS
`method is adopted blindly.
`This paper suggests a dynamic, power-conscious, real-
`time scheduling algorithm to resolve this problem.
`In
`general, multicore processors have some caches that are
`shared among their cores. Task migration between cores
`thus requires less overhead than migration between fully
`independent processors. With an exploitation of
`this
`property, Dynamic Repartitioning, which is the suggested
`scheme tries to keep the performance demands of each core
`balanced by migrating tasks from the core with the highest
`demand to the one with the lowest demand. Similar to
`multiprocessor systems, the dynamic performance demand
`of each core is given by existing DVS algorithms, and the
`migration decisions are made at every scheduling time. This
`repartitioning of tasks is expected to reduce the dynamic
`power consumption by lowering the maximum perfor-
`mance demand of the cores at any given moment.
`In addition to dynamic power, there is another source of
`power consumption that must be considered. Different
`from dynamic power, which is consumed during instruc-
`tion execution, leakage power is consumed as long as there
`is electric current in the circuits. In general, this energy loss
`is proportional to the circuit density and the total number of
`circuits in the processor. Leakage power has thus been
`taking up an increasing proportion of the total power, up to
`44 percent in 50 nm technology for an active cycle of a
`uniprocessor [18]. And, it will become even more in a
`multicore processor for the vastly increased circuits.
`In this paper, we also suggest a method of reducing the
`leakage power by adjusting the number of active cores.
`Dynamic Core Scaling decides on the optimal number of
`cores for the current performance demand and tries to meet
`this criterion as far as all deadlines are guaranteed.
`Dynamic Core Scaling is expected to save a considerable
`amount of leakage power in low load periods, where the
`leakage power makes up a large fraction of the total power
`consumption.
`The suggested Dynamic Repartitioning and Dynamic
`Core Scaling methods were evaluated through simulations
`by applying them to a well-known processor power model.
`The target task sets in the simulations were designed to
`demonstrate the behavior of the algorithm under diverse
`environments.
`
`TABLE 1
`Example Task Set [13]
`
`Task
`r 1
`r2
`r-J
`
`Period
`8 ms
`10 ms
`14 ms
`
`WCET
`3 ms
`3 ms
`1 ms
`
`Utilization
`0.375
`0.300
`0.071
`
`cc2
`1
`1
`1
`
`2
`1
`1
`
`The rest of this paper is organized as follows: Section 2
`reviews existing research on the use of DVS in real-time
`unicore processor systems and on the development of
`energy-efficient scheduling algorithms in multiprocessor
`and multicore systems. Section 3 defines the problem and
`describes the power consumption model used in this paper.
`In Section 4, we describe Dynamic Repartitioning algorithm
`as a way of efficiently reducing clock frequencies.
`In
`Section 5, we introduce Dynamic Core Scaling algorithm,
`which reduces the leakage power by adjusting the number
`of activated cores. Section 6 presents simulation results for
`the two algorithms, and Section 7 summarizes our
`conclusions.
`
`2 RELATED WORK
`2.1 DVS on a Unicore Processor
`In this paper, the WCET of a task will be taken as the time
`required to finish the worst-case execution path at
`maximum performance. The actual WCET of a task is the
`scaled value of its WCET to the current performance, and it
`increases linearly as performance degrades. In this paper,
`we will use the term utilization of a task to refer to its WCET
`divided by its period. It means the fraction of processor
`time dedicated to the task at maximum performance. A
`matter of course, the relative utilization of a task which is
`based on its actual WCET is also grows as the performance
`degrades.
`EDF is the optimal algorithm for preemptible periodic
`real-time task scheduling. Defining the utilization of a task
`as the value of its WCET divided by its period, EDF can
`guarantee meeting the deadlines of all task sets for which
`the sum of all task utilization is less than one. Based on this
`property, Pillai and Shin [13] suggested three DVS schedul-
`ing heuristics: Static, Cycle conserving, and Look ahead.
`The Static algorithm adjusts the clock frequency so that
`the total relative utilization of all tasks is 1.0. For example,
`the total relative utilization of the task set in Table 1 is 0.746.
`If the execution time of all tasks is inversely proportional to
`the clock frequency,
`then we can achieve the highest
`possible energy efficiency while still meeting all deadlines
`by scaling the performance down to 0.746 times the
`maximum clock frequency.
`A given task may be finished earlier than its WCET, and
`the actual execution time changes with every period. If the
`tasks in Table 1 have actual execution times as given in
`columns cc1 and cc2 during their first and second periods,
`respectively, then idle periods will result even executing at
`the frequency given by the Static algorithm. This means that
`the frequency could be lowered further.
`To exploit
`this phenomenon,
`the Cycle-conserving
`algorithm adopts the notion of dynamic utilization, which
`
`Authorized licensed use limited to: UNIST. Downloaded on May 26, 2009 at 21:48 from IEEE Xplore. Restrictions apply.
`
`Petitioner Samsung Ex-1031, 0002
`
`

`

`1542
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 11, NOVEMBER 2008
`
`~ 1.00
`~
`[ 0.75
`
`t.:..
`
`0.621
`lr
`
`0.50
`
`t,
`
`l2
`
`t ,
`
`12
`
`0
`U(t 1) : 0.375- 0.25
`
`10
`
`15 Time
`
`f 1.00
`.,.
`J: 0.75
`0.50
`
`0
`
`1.00
`
`>.
`u
`ii
`::,
`O"
`J: 0.75
`0.50
`
`0.746
`
`t I
`
`(a)
`
`0.546
`
`10
`U(t 1) : 0.375
`
`(b)
`
`0.546
`
`t ,
`
`l2
`
`15
`
`Time
`
`0.496
`
`0.296
`
`10
`5
`0
`U(l 1) : 0.375
`U(t 1) : 0.375---> 0.25
`U(t 2) : 0.3 --+ 0.1
`U(l 2) : 0.3
`
`15
`
`T ime
`
`(c)
`
`Fig. 1. Cycle-conserving algorithm on the example task set [13]. (a) After
`finish of executing 1. (b) After finish of executing 2 and 3. (c) Actual
`execution flow for two rounds.
`
`is updated whenever the tasks are scheduled and finished.
`On the completion of a task, it updates the utilization based
`on the task’s actual execution time. The next time the task
`begins executing, however, its utilization is restored to the
`original value based on the task’s WCET. In this manner,
`the Cycle-conserving algorithm may choose a lower
`frequency than the Static algorithm during the period
`between a task’s completion and the start of its next period.
`It thus saves more energy than the Static algorithm.
`Fig. 1 shows an example of
`the Cycle-conserving
`algorithm at work. The actual execution time of 1 is 2 ms
`(Fig. 1a). The utilization of 1 is thus updated from 3/8 to
`2/8 after its first execution, and the total utilization of the
`task set decreases to 0.621. At this point (Fig. 1b), the
`processor will be operated at 0.621 times the highest
`frequency. The utilization of 2 drops from 3/10 to 1/10
`after completion, and as a result, 3 can be executed at
`0.421 times the highest frequency. The actual execution
`flow under the Cycle-conserving algorithm for both rounds
`is shown in Fig. 1c.
`Cycle conserving is expected to lead to a higher energy
`efficiency than the Static algorithm, because it reduces the
`frequency during idle periods. As shown in Fig. 1c, at the start
`of each new period, it assumes the worst case for the current
`task. As a result, the frequency tends to start high and
`decrease gradually as tasks are completed. If the actual
`execution times of most tasks fall short of their WCET,
`however, it is better to start with a low frequency and defer
`the use of high-frequency processing as long as all deadlines
`
`can be met. This is the basic concept of the Look-ahead
`algorithm. When actual execution times usually fall short of
`their corresponding WCETs, the Look-ahead algorithm gives
`better results than Cycle-conserving. In cases where the
`actual execution times are usually close to their WCETs,
`however, Cycle conserving is the better choice.
`A variety of DVS algorithms have been proposed in
`addition to these. Aydin et al. [14], for example, have
`suggested the Generic Dynamic Reclaiming Algorithm
`(GDRA) and Aggressive Speed Adjustment (AGR) algo-
`rithms. GDRA is in many respects similar to the Cycle-
`conserving algorithm; AGR, however, sets the frequency
`based on the execution history of the tasks. Gruian [15]
`suggested an algorithm that starts at a low frequency and
`increases the processing speed gradually based on the
`statistics of the actual execution times. Kim et al. [17] also
`suggested the method to utilize slack time, which is based
`on the expectation of the slack time occurrences. These
`alternative approaches are helpful in cases, where trends are
`visible in the actual execution times, for example, when the
`most recent execution time is related to the previous one.
`
`2.2 Power-Aware Scheduling on Multiprocessors
`Besides the problem of deciding which task to execute at a
`certain time, multiprocessor real-time systems must also
`decide which processor the task will run on. Partitioned
`scheduling is the most widely used solution to this NP-hard
`problem. In partitioned scheduling, every processor has its
`own task queue, and in an initial stage, each task is
`partitioned into one of these queues. Each processor’s task
`set is scheduled with a single-processor scheduling algo-
`rithm such as EDF or RM [3], [5]. The partitioning itself is
`one variant of the well-known Knapsack problem, for
`which a number of heuristics such as Best Fit, Worst Fit, and
`Next Fit are known to work well.
`The partitioning approach has the advantage of utilizing
`DVS. Any of the many possible DVS algorithms described
`in Section 2.1 can be used to adjust the frequency of each
`processor and its associated task set. To maximize the
`energy efficiency, however, the utilizations of each parti-
`tioned set should be well balanced [4]; this is because the
`dynamic power consumption increases as the cube of the
`utilization.
`Aydin and Yang [4] proved that it is also an NP-hard
`problem to partition a list of tasks into a given number of
`sets that are optimally load balanced, with the guarantee
`that each task set can be scheduled on the system. They also
`showed that among well-known heuristics, worst
`fit
`decreasing (WFD) generates the most balanced sets. WFD
`applies the worst-fit algorithm to the tasks after sorting
`them in order of decreasing task utilization.
`There are also many scheduling heuristics for the variety
`configurations of target environments. Gruian [19] proposed
`a simulated annealing approach in multiprocessor energy
`efficient scheduling with the considerations of precedence
`constraints and predictable execution time for each task.
`Chen et al. [20] suggested an approximation algorithm
`with different approximation bounds for processors with/
`without constraints on processor speeds for the task set with
`common periods. Anderson and Baruah [21] suggested the
`
`Authorized licensed use limited to: UNIST. Downloaded on May 26, 2009 at 21:48 from IEEE Xplore. Restrictions apply.
`
`Petitioner Samsung Ex-1031, 0003
`
`

`

`SEO ET AL.: ENERGY EFFICIENT SCHEDULING OF REAL TIME TASKS ON MULTICORE PROCESSORS
`
`1543
`
`Number of Cores
`Active Cores
`
`Clock Frequency
`
`100%
`
`75%
`
`actual execution times may be shorter than the WCETs,
`which are close to the real world. If this is so, then additional
`energy can be saved with a dynamic approach.
`
`Clock Frequen cy
`'"---•.. -.- -.-.. -.. -.. -~-~.-'.i.o-__ d-__ -.. -.. -.. -.. -.. -.. -. -__ -__ .. _ .. -._-__ -----------~► Time
`
`25%
`
`Fig. 2. Example schedule generated by heuristic algorithm [12] for
`DVS-CMP.
`
`trade-off between increasing the number of processor and
`increasing the performance of each processor is explored,
`and they also suggested algorithms to solve the problem
`with static analysis. However, even though the many works
`have been done, most of them are based on the static
`analysis of the WCETs of tasks and have little consideration
`for utilizing slack time.
`
`2.3 Power-Aware Scheduling on Multicores
`While much research has examined the problem of energy-
`efficient scheduling for single-processor or multiprocessor
`systems, little work has been done on multicore processors.
`Nikitovic and Brorsson [22] assumed an adaptive chip-
`multicore processor (ACMP) architecture,
`in which the
`cores have several operating modes (RUN, STANDBY, and
`DORMANT) and can change their state dynamically and
`independently according to the performance demand. They
`suggested some scheduling heuristics for ACMP systems
`and demonstrated that these heuristics save a significant
`amount of energy for non-real-time task sets compared to a
`high-frequency unicore processor. Although this work
`introduced the benefits of processors with multiple cores
`that can change their operating mode independently, it
`does not take into consideration the demands of real-time
`task sets.
`The first energy-efficient approach to real-time scheduling
`on a multicore processor was suggested by Yang et al. [12],
`who assumed a DVS-enabled chip multiprocessor (DVS-
`CMP). In DVS-CMP systems, all cores share the same clock
`frequency but a core can “sleep” independently if it has no
`work to do. Yang et al. proved that the energy efficient
`scheduling of periodic real-time tasks on DVS-CMP system is
`an NP-hard problem. They thus suggested a heuristic
`algorithm for scheduling a framed, periodic, real-time task
`model. In this model all tasks have the same period, share a
`deadline which is equal to the end of the period, and start at
`the same time. As shown in Fig. 2, the suggested algorithm
`starts executing tasks at a low performance. As time goes on,
`cores with no tasks to run will be set to the sleep state. When
`the number of sleeping cores increases, the frequency must
`also increase to meet the deadlines of tasks that have not been
`finished yet. In this manner the number of cores running in a
`high frequency mode is reduced, and a significant amount of
`energy will be saved. The applications of this algorithm are
`limited, however, because it can be only used for the framed
`real-time systems in which all tasks have same dead-lines
`and starting points. Moreover it is also a static approach. In
`other words, it does not take into account cases where the
`
`3 SYSTEM MODEL
`3.1 Task Set Model
`The assumed target tasks are executed periodically, and
`each should be completed before its given deadline. A
`completed task rests in sleep state until its next period
`begins, at
`the start of which the task will again be
`activated and prepared for execution. The tasks have no
`interdependency.
`A task set T is defined by (1), where i
`is the
`ith individual task in T . Each task has its own predefined
`period pi and WCET wi;
`the latter is defined as the
`maximum execution time required to complete i at the
`highest possible processor frequency. The real worst-case
`execution time of i thus increases from wi as the clock
`frequency decreases. The nearest deadline at the current
`time is defined as di:
`T ¼ 1ðp1; w1Þ; . . .;  nðpn; wnÞ
`f
`g:
`
`ð1Þ
`
`task i
`The utilization ui of
`is defined by (2). A
`proportion ui of the total number of cycles of a core will
`be dedicated to executing i:
`ui ¼ wi=pi:
`U, the total utilization of T , is defined as (3):
`X
`
`ð2Þ
`
`U ¼
`
`ui:
`
`8i2T
`
`ð3Þ
`
`The processor S consists of multiple cores and is defined
`in (4). The nth core in S is denoted as Cn. The number of
`cores in S is denoted as m. Each core is assumed to have
`identical structure and performance. We also assume that
`resource sharing between the cores does not introduce any
`interference overhead. We have
`S ¼ fC0; . . .; C mg:
`
`ð4Þ
`
`F , the relative performance of S and the scaling factor for
`the operating clock frequency, is a number between 0 and 1.
`If the performance demand on S is F , then the actual
`frequency is the highest possible frequency of S multiplied
`by the factor F .
`The system follows the partitioned scheduling approach;
`any well-known heuristic such as BFD, NFD, etc., may be
`adopted. The partitioned state of T on S is denoted P, and
`the partitioned task set allocated to core Cn is denoted as Pn.
`The utilization of Pn is defined by (5):
`X
`
`Un ¼
`
`ui:
`
`8i2Pn
`
`ð5Þ
`
`For ease of description and explanation, we further
`define the two functions given by (6) and (7). ðiÞ gives the
`core that i was initially partitioned into, and ðiÞ gives the
`
`Authorized licensed use limited to: UNIST. Downloaded on May 26, 2009 at 21:48 from IEEE Xplore. Restrictions apply.
`
`Petitioner Samsung Ex-1031, 0004
`
`

`

`1544
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 19, NO. 11, NOVEMBER 2008
`
`core that i
`is currently located in. This distinction is
`necessary because we will dynamically migrate tasks
`between the cores:
`ðiÞ ¼ Cj in which i was initially partitioned;
`
`ð6Þ
`
`ð7Þ
`
`ðiÞ ¼ Cj in which i is currently partitioned:
`Each partitioned task set is scheduled using EDF on its
`corresponding core. The performance demand of each core
`is decided by running the Cycle-conserving algorithm on
`each core individually.
`To apply the Cycle-conserving algorithm, we define
`some dynamically updated variables. The Cycle-conserving
`utilization li of task i, which is initially equal to ui, is
`defined by (8). After the execution of a task, li is updated
`using the most recent actual execution time cci as the
`numerator of (2) instead of wi. After the period pi elapses,
`the task will be executed again and may now meet the worst
`case execution conditions; the utilization of the task will
`thus be reset to ui. As a result, li is updated after every
`deadline of i:
`
`
`li ¼ wi=pi
`cci=pi
`
`if i is unfinished
`if i is finished:
`
`ð8Þ
`
`Ln, the dynamic utilization of core Cn, is defined by (9).
`Ln is the current performance demand on Cn. Thus, as long
`as F is greater than Ln, all the deadlines of tasks in Cn will
`be met by the EDF scheduling algorithm. We will also use L
`to refer to the Cycle-conserving utilization of a core when
`the context is unambiguous. Thus, L of Ci also means Li:
`Ln ¼

`ð9Þ
`
`X
`
`8 unfinished i2Pn
`
`wi
`pi
`
`:
`
`X
`
`cci
`pi
`
`8 finished i2Pn
`
`3.2 Power Model
`The total power consumption of a CMOS-based processor
`consists of its dynamic power Pdynamic and its leakage power
`Pleakage. We construct a processor model to evaluate the
`energy efficiency of the proposed algorithms.
`Most of Pdynamic
`is the capacitive switching power
`consumed during circuit charging and discharging. Gen-
`erally, it is the largest part in the processor power during
`executing instruction. Pdynamic can be expressed in terms of
`the operating voltage Vdd, the clock frequency f, and the
`switching capacity cl as follows [23]:
`ð10Þ
`
`Pdynamic ¼ cl V 2dd f:
`The clock frequency f is itself related to several other
`factors, as given by (11). The threshold voltage Vth is a
`function of the body bias voltage Vbs, as seen in (12). Here,
`Vth1 , , K1, and K2 are constants depending on the processor
`fabrication technology. Generally,  is between 1 and 2, so
`raising Vdd above the threshold voltage enables the
`processor to increase the clock frequency. In the assumed
`processor model, the change of f is assumed to accompany
`with switching Vdd to the lowest allowable point:
`
`f ¼ ðVdd VthÞ
`
`;
`
`LdK6
`Vth ¼ Vth1 K1 Vdd K2 Vbs:
`
`ð11Þ
`
`ð12Þ
`
`TABLE 2
`Constants Based on the 70 nm Technology [24]
`
`Variable
`K1
`K 2
`K 3
`/(4
`Ks
`/(6
`Vb s
`Va,1
`
`Value
`0.063
`0.153
`5.3 X 10- 7
`1.83
`4.19
`5.26 X 10- 12
`-0.7
`0.244
`
`Variable
`I j
`Ct
`Ld
`Ly
`~
`f.m in
`f,nax
`
`Value
`4. 0 X 10-
`4.3 X 10-tO
`37
`4 X 106
`1.5
`1 X 109
`3 X 109
`
`Pleakage is caused by leakage current, which flows even
`while no instructions are being executed. To calculate the
`leakage power consumption, we adopt a processor power
`used in existing research [24], [25]. Pleakage mainly consists of
`the subthreshold leakage current Isubn and the reverse bias
`junction current Ij. Pleakage can be expressed as a function of
`these two variables, as in (13).
`Isubn is defined by (14), where Lg is the number of
`components in the circuit. K3, K4, and K5 are constants
`determined by the processor fabrication technology:
`Pleakage ¼ Lg ðVdd Isubn þ j Vbs j IjÞ;
`
`ð13Þ
`
`ð14Þ
`Isubn ¼ K3 eK4Vdd eK5Vbs :
`The processor cores are assumed to consume both Pdynamic
`and Pleakage while executing instructions, and only Pleakage
`during idle periods.
`A multicore processor actually has some shared compo-
`nents as well, such as processor caches, buses, and so on. In
`this paper, we do not count the power consumption from
`these shared components because our goal is to reduce the
`power consumption of the cores themselves. The power
`consumption of a multicore processor is thus simply
`obtained by summing the power consumption of
`the
`individual cores.
`The core is assumed to have two operating modes: an
`active state, in which it is able to execute instructions and a
`sleep state in which it ceases working and rests with
`minimized power consuming. In the sleep state, the only
`possible operation is a transition into the active state. In this
`paper, we assume that the state transition introduces no
`additional overhead because this factor can be treated easily
`in practical implementations.
`In the sleep state, it is assumed that there is no Pdynamic
`and that Pleakage is 3 percent of Pleakage at the active state with
`current frequency f [26].
`just de-
`To simulate the power consumption model
`scribed, we adopt the realistic constants [24] given in Table 2.
`These constants have been scaled to 70 nm technology and
`are based on the original values [25] of Transmeta’s Crusoe
`Processor which uses 180 nm technology.
`By adopting the constants in Table 2, we obtain the
`power consumption function depicted in Fig. 3. As f
`to Ptotal
`decreases, the ratio of Pleakage
`increases. Below
`1.5 GHz, Pleakage is greater than Pdynamic. Pdynamic increases
`rapidly as f increases, however, and Ptotal thus rapidly
`increases in the high f domain.
`
`Authorized licensed use limited to: UNIST. Downloaded on May 26, 2009 at 21:48 from IEEE Xplore. Restrictions apply.
`
`Petitioner Samsung Ex-1031, 0005
`
`

`

`SEO ET AL.: ENERGY EFFICIENT SCHEDULING OF REAL TIME TASKS ON MULTICORE PROCESSORS
`
`1545
`
`1.0
`
`0.8
`
`0.2
`
`0
`
`- - Total Power
`·""" Dynamic Power
`-
`- - Leakage Power
`
`1.5
`2
`Clock Frequency (GHz)
`
`2.5
`
`3
`
`Fig. 3. Power consumption of a 70 nm core as a function of clock
`frequency.
`
`from the current time to a certain time point t according to
`the current schedule as Mn;t. Importing i into Cdst can be
`i þ Mdst;di < 1:0.
`done only when u0
`If i has the nearest deadline among the unfinished tasks
`scheduled in Csrc, exporting i can be treated as if it was
`finished at that time. Thus, Lsrc can be updated using (9). In
`this paper, we will allow the migration of a task only if it
`has the nearest deadline among the unfinished tasks in Psrc.
`This migration is only effective for the current period.
`After the period, the migrated task should be returned to its
`original core. However, returning to the original core is only
`conceptual. If the condition can be met, the task that was
`executed in a foreign core can be exported into the same
`core again right after the next release and this will be seen
`as the task remains for its next period.
`A migrated task may be finished earlier than its worst
`case execution time. It allows reducing L0
`dst as (17), which is
`
`a combination of (16) and (9):
`dst ¼ Ldst þ cci cmi
`L00
`
`until di
`after di:
`
`ð17Þ
`
`di T
`
`Ldst
`
`The migration operations can be overlapped and taken
`in recursive manner. In other words, a task executed in a
`foreign core that was migrated from the original core can
`be exported to the other core within the period as long as
`the conditions are met. Moreover, if a core has imported a
`task that is not completed yet, it can export multiple tasks
`to the other cores as long as the tasks have the nearest
`deadline among the unfinished tasks in the core at the
`exporting time.
`We now suggest a dynamic approach to balancing the
`dynamic utilizations of each core as follows:
`Notion of dynamic repartitioning. Let us assume that
`the partitioned state P has been generated as defined in
`Section 3.1. We further define the variables Lmax and Lmin as
`the maximum and minimum L values among all cores at
`calling time:
`
`8Cn 2 S; Lmax  Ln;
`
`ð18Þ
`
`ð19Þ
`8Cn 2 S; Lmin  Ln:
`For each core, Cmax and Cmin, which are the cores that
`if 9k such that
`have Lmax and Lmin, respectively,
`ðk 2 PmaxÞ ^ ðu0k þ Lmin < LmaxÞ, temporally migrating k
`
`from Pmax to Pmin will lower the performance demand of
`processor S.
`
`that ðP0max ¼
`Our approach will replace P with P0
`
`PmaxfkgÞ^ ðP0min¼PminþfkgÞ ^ ð8Cn2 SfCmax; Cming;
`n ¼ PnÞ until dk. Under the initial partitioned state P, all
`P0
`deadlines are guaranteed to be met because 8Cn 2 S, Un < 1
`by the assumption. If the above conditions are met, then all
`deadlines will also be met under P0
`. As this partition is the
`result of migrating k 2 Cmax to Cmin, it follows that 8Cn 2 S,
`Un < 1. Similarly, the repartitioned state P00
`based on P0
`is
`obtained by temporarily migrating a certain task l from C0
`max
`) until dl and P00
`min (as determined on P0
`to C0
`also guarantees
`
`
`all deadlines until a certain task l 2 fP00max [ P00ming.
`
`If we compare Lmax (of P) to L0max (of P0
`), it will always
`be true that Lmax > L0
`max, at least until a new period starts
`for one of the tasks in C0
`max. All the cores except for Cmax
`
`4 DYNAMIC REPARTITIONING
`Because Pdynamic / f 3, minimizing Pdynamic in a multicore
`processor for a given task set is essentially the problem of
`generating partitioned task sets with the most balanced
`utilization [4]. However, even though the initial partitioned
`state is well-balanced the performance demand on each
`core changes frequently during runtime. Thus, to achieve a
`consistently low power consumption,
`the performance
`demand of each core must stay balanced during operation.
`The intuitive way to solve the temporal unbalance is
`migrating some

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket