throbber
Synthesis Techniques for Low-Power Hard Real-Time Systems on Variable
`Voltage Processors
`
`Inki Hongy, Gang Quy, Miodrag Potkonjaky, and Mani B. Srivastavaz
`yComputer Science Department, University of California, Los Angeles, CA 90095-1596 USA
`zElectrical Engineering Department, University of California, Los Angeles, CA 90095-1596 USA
`
`Abstract
`
`The energy efficiency of systems-on-a-chip can be much
`improved if one were to vary the supply voltage dynami-
`cally at run time. In this paper we describe the synthesis
`of systems-on-a-chip based on core processors, while treat-
`ing voltage (and correspondingly, the clock frequency) as a
`variable to be scheduled along with the computation tasks
`during the static scheduling step. In addition to describ-
`ing the complete synthesis design flow for these variable
`voltage systems, we focus on the problem of doing the volt-
`age scheduling while taking into account the inherent limi-
`tation on the rates at which the voltage and clock frequency
`can be changed by the power supply controllers and clock
`generators. Taking these limits on rate of change into ac-
`count is crucial since changing the voltage by even a volt
`may take time equivalent to 100s to 10,000s of instructions
`on modern processors. We present both an exact but im-
`practical formulation of this scheduling problem as a set
`of non-linear equations, as well as a heuristic approach
`based on reduction to an optimally solvable restricted or-
`dered scheduling problem. Using various task mixes drawn
`from a set of nine real-life applications, our results show
`that we are able to reduce power consumption to within 7%
`of the lower bound obtained by imposing no limit at the rate
`of change of voltage and clock frequencies.
`
`1 Introduction
`
`In recent years the demand for portable battery-operated
`computing and communication devices has made low power
`consumption an essential design attribute. The power re-
`duction approaches that have emerged so far include reduc-
`tion of switched capacitance, activity based system shut-
`down, and aggressive supply voltage reduction via exploita-
`tion of quadratic dependence of power on voltage together
`with parallelism and pipelining to recoup lost throughput.
`However, aggressive supply voltage reduction [1],the most
`powerful of these techniques, is usable only if throughput
`(data sample rate) is the sole metric of speed. A single tight
`
`latency constraint, as is often present in embedded systems,
`renders the technique ineffective.
`The problem outlined above really arises because con-
`ventional systems are designed with a fixed supply voltage.
`However, there is no fundamental reason that the supply
`voltage has to be fixed. Instead, it can in principle be varied
`dynamically at run time. Indeed, advances in power supply
`technology makes it possible to vary the generated supply
`voltage dynamically under external control. While many
`CMOS circuits have always been capable of operating over
`a range of supply voltages, it is the recent progress in power
`supply circuits [21, 9] that has made feasible systems with
`dynamically variable supply voltages. Since both the power
`consumed and the speed (maximal clock frequency) are a
`function of the supply voltage, such variable voltage sys-
`tems can be made to operate at different points along their
`power vs. speed curves in a controlled fashion.
`In particular a static or dynamic scheduler, in addition
`to its conventional task of scheduling the computation op-
`erations on hardware resources, may also schedule changes
`in voltage as a function of timing constraints and chang-
`ing system state. Such voltage scheduling would allow for
`much higher energy efficiency (lower power) for a wider
`class of applications than is possible by operating the sys-
`tem at one or two fixed points on the power-speed curve,
`as is done by conventional approaches of supply voltage re-
`duction to a fixed value [1] and system shutdown [25]. The
`benefits of quadratic dependence of power on voltage thus
`become available even in event driven systems as well as in
`the presence of joint latency and throughput constraints
`
`In this paper, we develop the first approach for power
`minimization of scheduling, instruction and data cache size
`determination, and processor core selection. We establish
`the theoretical framework for designing variable voltage sys-
`tems by treating voltage as an optimization degree of free-
`dom for the applications with real-time constraints and pro-
`viding the optimal variable voltage scheduling algorithm for
`some special cases, where the speed overhead for chang-
`ing voltages is explicitly accounted for. We develop an ef-
`fective scheduling heuristic for a general case based on the
`
`MICROCHIP TECHNOLOGY INC. EXHIBIT 1018
`Page 1 of 10
`
`

`

`optimal algorithm for the restricted problem. By selecting
`the most efficient voltage profile in the presence of multiple
`timing constraints under the realistic variable voltage hard-
`ware model, our algorithms result in significant savings in
`energy.
`The rest of the paper is organized in the following way.
`Section 2 presents the related work. Section 3 explains
`the necessary background. Section 4 discusses the variable
`voltage scheduling problem and provides an optimal solu-
`tion to some special cases. The global design flow of the
`novel synthesis approach is presented in Section 5. In Sec-
`tion 6 we establish the computational complexities of the
`variable voltage scheduling problem and propose an effec-
`tive heuristic based on the optimal algorithm for some spe-
`cial cases. Section 7 presents experimental data. Section 8
`concludes.
`
`2 Related work
`
`Low power system synthesis has emerged as an impor-
`tant area of research in last 5 years. Several good reviews
`on power estimation and minimization techniques exist [20,
`23]. We review the research results relevant to low power
`systems based on dynamically variable voltage hardware.
`
`2.1 Variable voltage system and design issues
`
`Of direct relevance to our research is the prior research
`activity on technology, circuits, and techniques for vari-
`able voltage systems. At the technology level, efficient DC-
`DC converters that allow the output voltage to be rapidly
`changed under external control have recently been devel-
`oped [21, 26].
`At the hardware design level, research work has been
`performed on chips with dynamically variable supply volt-
`age that can be adjusted based on (i) process and temper-
`ature variations, and (ii) processing load as measured by
`the number of data samples queued in the input (or output)
`buffer. Dynamically adapting voltage (and the clock fre-
`quency), to operate at the point of lowest power consump-
`tion for given temperature and process parameters was first
`suggested by Macken et. al. [19]. Nielsen et al. [22] ex-
`tended the dynamic voltage adaptation idea to take into ac-
`count data dependent computation times in self-timed cir-
`cuits. Recently, researchers at MIT [9] have extended the
`idea of voltage adaptation based on data dependent compu-
`tation time from [22] to synchronously clocked circuits. Be-
`cause these approaches rely on run-time reactive approaches
`to dynamic voltage adaptation, they work fine only where
`average throughput. is the metric of performance. There-
`fore, these approaches are inapplicable to hard real time
`systems such as embedded control applications with strict
`timing requirements.
`
`There has been research on scheduling strategies for ad-
`justing CPU speed so as to reduce power consumption. Most
`existing work is in the context of non-real-time workstation-
`like environment. Weiser et al. [29] proposed an approach
`where time is divided into 10-50 msec intervals, and the
`CPU clock speed (and voltage) is adjusted by the task-level
`scheduler based on the processor utilization over the pre-
`ceding interval. Govil et al. [8] enhanced [29] by propos-
`ing and comparing several predictive and non-predictive ap-
`proaches for voltage changes, and concluded that smooth-
`ing helps more than prediction. Finally, Yao et al. [31] de-
`scribed an off-line minimum energy scheduling algorithm
`and an on-line average rate heuristic for job scheduling with
`preemption for independent processes with deadlines under
`the assumption that the supply voltage can be changed arbi-
`trarily without any timing and power overhead.
`Hong et al. [10] proposed an approach for the low power
`core-based real-time system-on-chip based on dynamically
`variable voltage hardware. While they developed the non-
`preemptive variable voltage scheduling heuristic with the
`assumption of zero delay in changing voltage levels, in this
`paper we focus on the preemptive variable voltage schedul-
`ing while taking into account the inherent limitation on the
`rates at which voltage and clock frequency can be changed
`by the power supply controllers and clock generators.
`
`2.2 System level and behavioral level power esti-
`mation
`
`Landman and Rabaey have performed the work on the
`macro and micro level power estimation [16]. The work at
`Princeton and Fujitsu on the estimation of power consump-
`tion for programmable processors [28] is of particular rele-
`vance. Evans and Franzon [5] and Su and Despain [27] have
`proposed power estimation model for cache, specifically for
`SRAM.
`
`2.3 System level and behavioral level power opti-
`mization
`
`One well known technique that results in substantial power
`reduction is reducing the voltage to reduce the switching
`power, which is the dominant source of power dissipation
`in CMOS circuit and is proportional to V 
`dd where Vdd is
`the supply voltage [1]. However, this reduction came with
`a speed penalty due to increased gate delays. The gate de-
`Vdd
`lay is proportional to
`Vdd(cid:0)VT  , where VT is the threshold
`voltage [1]. Chandrakasan et al. [1] have shown that the
`strategy of operating at a fixed reduced voltage can be cou-
`pled with architectural level parallelism and pipelining to
`compensate for lower clock rate due to voltage reduction so
`that the overall throughput remains the same but the over-
`all power is still lowered, although at the cost of increased
`latency.
`
`MICROCHIP TECHNOLOGY INC. EXHIBIT 1018
`Page 2 of 10
`
`

`

`Microprocessor core
`StrongARM
`ARM, 7
`ARM, 7 Low-Power
`LSI Logic, TR4101
`LSI Logic, CW4001
`LSI Logic, CW4011
`DSP Group, Oak
`NEC, R4100
`Toshiba, R3900
`Motorola, 68000
`PowerPC, 403
`ARM 710 (ARM7 / 8KB)
`SA-110 (StrongARM / 32KB)
`
`Clock (MHz) MIPS
`233
`266
`40
`36
`27
`24
`81
`30
`60
`53
`80
`120
`80
`80
`40
`40
`50
`50
`33
`16
`33
`41
`40
`72
`200
`230
`
`Technology (m)
`0.35
`0.6
`0.6
`0.35
`0.5
`0.5
`0.6
`0.35
`0.6
`0.5
`0.5
`0.6
`0.35
`
`Area (mm)
`4.3
`5.9
`3.8
`2
`3.5
`7
`8.4
`5.4
`15
`4.4
`7.5
`34
`50
`
`Power diss. (mW ) (Volt.)
`300 (1.65)
`200 (5)
`45 (3.3)
`81 (3.3)
`120 (3.3)
`280 (3.3)
`190 (5)
`120 (3.3)
`400 (3.3)
`35 (3.3)
`40 (3.3)
`424 (5)
`900 (1.65)
`
`Table 1. The performance, area, and power data for a subset of processor cores.
`
`Several researchers have addressed the issue of power in
`event-driven systems, and proposed various techniques for
`shutting down the system or parts of the system [25, 12,
`4]. Compilation techniques for low power software have
`emerged for both general-purpose computation [28] and DSP
`computations [11]. Numerous behavioral synthesis research
`efforts have also addressed power minimization [23].
`Four research groups have addressed the use of multiple
`(in their software implementation restricted to two or three)
`different voltages [2, 13, 18, 24]. They used the term “vari-
`able voltage” for a fixed number of simultaneously available
`voltages.
`
`3 Preliminaries
`
`3.1 Task model
`
`A set T of independent tasks is to be executed on a
`system-on-chip. Each task Ti  T is associated with the
`following parameters:
` ai its arrival time
` di its deadline
` pi its period
` Wi its required number of CPU cycles
`We assume, without loss of generality, that all tasks have
`identical periods. When this is not the case, a simple prepro-
`cessing step and application of the least common multiple
`(LCM) theorem [17] transforms an arbitrary set of periods
`to this design scenario in polynomial time. As the conse-
`quence, when the LCM theorem is applied, there may be
`a need to run several iterations of a given task within this
`overall period.
`
`3.2 Variable voltage hardware model
`
`The variable voltage generated by the DC-DC switch-
`ing regulators in the power supply cannot instantaneously
`make a transition from one voltage to another. For exam-
`ple, [21] reported transition times of 6 msec/volt for a DC-
`DC switching converter. In personal communications with
`
`the authors, researchers at MIT and Berkeley have reported
`transitions times of 10 to 100 microsec/volt.
`Dual to the supply voltage variation is the accompanying
`variation of the clock frequency. The time overhead asso-
`ciated with voltage switching is significant in the order of
`100s to 10000s of instructions in modern microprocessor.
`Fortunately, the computation itself can continue during the
`voltage and frequency change period. However, the speed
`overhead during the transition period must be explicitly ac-
`counted for.
`As suggested in [21], we employ a linear model to de-
`scribe the speed overhead. The maximum rate of voltage
`change is specified for the power supply regulator, and we
`can make a transition from a voltage to any other voltages
`within the maximum rate.
`
`3.3 Power model
`
`It is well known that there are three principal compo-
`nents of power consumption in CMOS integrated circuits:
`switching power, short-circuit power, and leakage power.
`The switching power, which dominates power consump-
`
`tion, is given by P = CLV ddfclock, as indicated earlier.
`CL is defined to be effective switched capacitance. It is
`also known that reduced voltage operation comes at the cost
`of reduced throughput [1]. The gate delay T follows the fol-
`Vdd
`lowing formula: T = k
`Vdd(cid:0)VT  where k is a constant [1]
`and VT is the threshold voltage. From these equations to-
`gether with the observation that the speed is proportional to
`f and inversely proportional to the gate delay, the power vs.
`speed curve can be derived. In particular, the normalized
`power Pn (= 1 at Vdd = Vref ) as a function of the normal-
`ized speed Sn (= 1 at Vdd = Vref ). For example, assuming
`Vref = 3.3 volts and VT = 0.8 volts, the power vs. speed
`curve follows the following equation:
`
`n +p: S
`Pn = :  S
`n + :  Sn : 
`S
`n + :  Sn + : S
`n + : Sn
`By varying Vdd, the system can be made to operate at
`different points along this curve. From the equation, it is
`easy to see that the power is a convex function of speed.
`
`MICROCHIP TECHNOLOGY INC. EXHIBIT 1018
`Page 3 of 10
`
`

`

`3.4 Target architecture
`
`Several factors combine to influence system performance:
`instruction and data cache miss rates and penalty, processor
`performance, and system clock speed. Power dissipation of
`the system is estimated using processor power dissipation
`per instruction and the number of executed instructions per
`task, supply voltage, energy required for cache read, write,
`and off-chip data access as well as the profiling information
`about the number of cache and off-chip accesses.
`Data on microprocessor cores have been extracted from
`manufacturer’s datasheets as well as from the CPU Center
`Info web site [3]. A sample of the collected data is pre-
`sented in Table 1, The last two rows of the table show two
`integrated microprocessor products with on-chip caches and
`their system performance data. We assume that the proces-
`sor cores follow the power model described in the preceding
`subsection.
`
`Line size
`Cache
`16B
`32B
`64B
`128B
`256B
`512B
`size
`8B
`5 99
`6 02
`-
`-
`-
`-
`512B
`6 07
`6 13
`6 07
`6 23
`-
`-
`-
`1KB
`6 44
`6 52
`6 36
`6 34
`6 51
`-
`-
`2KB
`6 88
`7 02
`6 66
`6 49
`6 56
`7 35
`-
`4KB
`7 67
`7 81
`7 21
`6 99
`6 91
`7 65
`9 40
`8KB
`8 34
`8 58
`8 00
`7 62
`7 54
`8 14
`9 80
`16KB
`9 30
`9 45
`8 91
`8 59
`8 69
`9 30
`10 04
`32KB
`1 04e-08
`Table 2.
`A subset of the cache latency model:
`minimal cycle time (ns) for various direct-mapped
`caches with variable line sizes.
`We use CACTI [30] as a cache delay estimation tool with
`respect to the main cache design choices: size, associativ-
`ity, and line size. The energy model has been adopted from
`[5] and [27]. The overall cache model has been scaled with
`respect to actual industrial implementations. Caches typi-
`cally found in current embedded systems range from 128B
`to 32KB. Although larger caches correspond to higher hit
`rates, their power consumption is proportionally higher, re-
`sulting in an interesting design trade-off. Higher cache as-
`sociativity results in significantly higher access time. We
`use a recently developed compiler strategy which efficiently
`minimizes the number of cache conflicts that considers direct-
`mapped caches [14]. We have experimented with 2-way
`set associative caches which did not dominate comparable
`direct-mapped caches in a single case. Cache line size was
`also variable in our experimentations. Its variations corre-
`sponded to the following trade-off: larger line size results in
`higher cache miss penalty delay and higher power consump-
`tion by the sense amplifiers and column decoders, while
`smaller line size results in large cache decoder power con-
`sumption. Extreme values result in significantly increased
`access time. We estimated the cache miss penalty based
`on the operating frequency of the system and external bus
`width and clock for each system investigated. This penalty
`ranged between 4 and 20 system clock cycles. Write-back
`was adopted in opposed to write-through, since it is proven
`
`No optimizations
`
`Cache size
`512B
`1KB
`2KB
`4KB
`8KB
`16KB
`32KB
`
`8B
`0.330
`0.356
`0.422
`0.651
`1.146
`2.158
`4.198
`
`16B
`0.378
`0.394
`0.444
`0.666
`1.156
`2.164
`4.202
`
`32B
`0.468
`0.463
`0.489
`0.694
`1.175
`2.174
`4.209
`
`Block buffering, sub-banking
`and Gray code addressing [27]
`8B
`16B
`32B
`0.291
`0.322
`0.401
`0.295
`0.299
`0.345
`0.317
`0.271
`0.271
`0.456
`0.334
`0.273
`0.769
`0.504
`0.347
`1.412
`0.869
`0.530
`2.702
`1.608
`0.922
`
`Table 3. A subset of the cache power consumption
`model: power consumption (nJ) estimation for vari-
`ous direct-mapped caches with variable line sizes.
`
`to provide superior performance and especially power sav-
`ings in uniprocessor systems at increased hardware cost [15].
`Each of the processors considered is able to issue at most a
`single instruction per clock period. Thus, caches were de-
`signed to have a single access port. A subset of the cache
`model data is given in Tables 2 and 3. Cache access delay
`and power consumption model were computed for a number
`of organizations and sizes, assuming the feature size of 0.5
`m and typical six transistors per CMOS SRAM cell im-
`plementation. The nominal energy consumption per single
`off-chip memory access, nJ, is adopted from [6].
`
`4 Theory of variable voltage scheduling: spe-
`cial solvable cases
`
`Yao, Demers and Shenker [31] have provided the opti-
`mal preemptive static scheduling algorithm for a set of in-
`dependent tasks with arbitrary arrival times and deadlines.
`The algorithm is based on the concept of critical regions
`and the tasks in critical regions are scheduled by the ear-
`liest deadline first (EDF) algorithm. The running time of
`the algorithm is Onlogn for n tasks. When there is a
`limit on the maximum voltage change rates, then the prob-
`lem becomes NP-complete, which is shown by the reduc-
`tion from the SEQUENCING WITH DEADLINES AND
`SETUP TIMES problem in page 238 of [7]. In this Sec-
`tion, we discuss some special cases which can be solved
`optimally.
`We find the speed function St of the processor to mini-
`mize the energy consumption during the time period t ; t.
`The voltage function V t can be easily obtained from St
`[1]. We assume that the speed of the processor can be
`changed between Smin and Smax with a maximum change
`rate K, i.e., at any time t, the speed St satisfies:
`(1)
`Smin  St  Smax
`(2)
`jS tj  K
`The amount of work that the processor completes during
`time interval t ; t is given by:
`(3)
`St dt
`t
`The power, or energy consumed per unit time, is a con-
`vex function P S of the processor’s speed. The function
`
`W = R t
`
`MICROCHIP TECHNOLOGY INC. EXHIBIT 1018
`Page 4 of 10
`
`

`

`E = R t
`
`(4)
`
`P(S) depends on the technology considered. The total en-
`ergy consumed during the time interval t ; t is:
`P St dt
`t
`Lemma 1. Starting with speed S, the work W that the
`processor can complete, in the time interval t ; t, falls
`into the range Wmin; Wmax, with
`Wmin = Sct (cid:0) t  + S(cid:0)Sc
`Wmax = Sdt (cid:0) t  (cid:0) S(cid:0)Sd
`K
`where Sc = maxSmin; S (cid:0) Kt (cid:0) t  and
`Sd = minSmax; S + Kt (cid:0) t .
`
`K
`
`S0
`
`Smin
`
`S0
`
`,--~
`ll"
`u._-=i_ [LL
`
`Smin
`
`t1
`
`t2
`
`t1
`
`t2
`
`Smax
`
`S0
`
`(b)
`
`,---77 I
`/1
`[LL
`[LJ_
`(c)
`
`Smax
`
`S0
`
`t1
`
`t2
`
`t1
`
`t2
`
`(a)
`Figure 1.
`Illustrated Wmin, Wmax for Lemma 1.
`(a) Wmin if S (cid:0) Kt (cid:0) t  Smin
`(b) Wmin if S (cid:0) Kt (cid:0) t   Smin
`(c) Wmax if S + Kt (cid:0) t   Smax
`(d) Wmax if S + Kt (cid:0) t  Smax
`
`(d)
`
`Since power is a convex function of the processor’s speed,
`it is obvious that if we can start from any speed, to complete
`a given workload W in any time interval of fixed length
`
` t, we should run our processor at the constant speed W t .
`When we do not have the freedom to choose the starting
`speed, the following theorem shows that the best we can do
`is to change the speed as fast and early as possible until we
`reach a critical point, then keep the constant speed.
`
`S (cid:0) Ktc (cid:0) t ;
`
`S + Ktc (cid:0) t ;
`
`Theorem 1. With the same constraints as in Lemma 1, for
`any workload W  Wmin; Wmax, there is a unique speed
`function S : t ; t ! Smin; Smax such that (1), (2),
`and (3) are all satisfied and the energy (4) is minimized.
`Moreover, St is defined as:
`if Wmin  W  St (cid:0) t ,
`if t  t  tc;
`St = S (cid:0) Kt (cid:0) t ;
`if tc t  t.
`where tc = t (cid:0)qt (cid:0) t  + 
`K W (cid:0) St (cid:0) t .
`and if St (cid:0) t   W  Wmax ,
`if t  t  tc;
`St = S + Kt (cid:0) t ;
`if tc t  t.
`where tc = t (cid:0)qt (cid:0) t  (cid:0) 
`K W (cid:0) St (cid:0) t .
`Next we consider the case when there is a finishing speed
`constraint. Similarly, we have the following results:
`Lemma 2. Starting with speed S, to reach the finishing
`speed S at the end of the time interval t ; t, the work
`W that the processor has to complete falls into the range
`Wmin; Wmax, with
`
`
`Wmin = Sct (cid:0) t  + S(cid:0)ScK + S (cid:0)Sc
`
`K
`
`Wmax = Sdt (cid:0) t  (cid:0) S(cid:0)SdK (cid:0) S (cid:0)Sd
`
`K
`(cid:0) K t(cid:0)t 
`where Sc = maxSmin; S+S 
` and
`
`
`Sd = minSmax; S+S 
`+ K t(cid:0)t 
`.
`
`
`Theorem 2. With the same constraints as in Lemma 2, for
`any workload W  Wmin; Wmax, there is a unique speed
`function S : t ; t ! Smin; Smax such that (1), (2), and
`(3) are satisfied and the energy (4) is minimized. The speed
`function will be a step function that satisfies the following:
`if Wmin  W  W ,
`St = 
`S (cid:0) Kt (cid:0) t ;
`S (cid:0) Kx (cid:0) t ;
`S (cid:0) Kt (cid:0) t;
`:
`if W  W  W,
`St = 
`S + Kt (cid:0) t ;
`S + Kx (cid:0) t ;
`S (cid:0) Kt (cid:0) t;
`:
`and if W  W  Wmax ,
`if t  t  x ;
`St = 
`S + Kt (cid:0) t ;
`if x t x;
`S + Kx (cid:0) t ;
`if x t t.
`S + Kt (cid:0) t;
`:
`where W ; W; x ; x are constants.
`
`if t  t  x ;
`if x t x;
`if x t t.
`
`if t  t  x ;
`if x t x;
`if x t t.
`
`From Theorem 2, it is simple to see that the speed func-
`tions which minimize energy are of some restricted shapes.
`For example, if S S, then St has to be one of the
`following seven shapes in Figure 2.
`
`S1
`
`S0
`
`S1
`
`S0
`
`S1
`
`S0
`
`S1
`
`S0
`
`t1
`
`t2
`
`(a)
`
`t1
`
`t2
`
`(b)
`
`t1
`
`t2
`
`(c)
`
`t1
`
`t2
`
`(d)
`
`S1
`
`S0
`
`S1
`
`S0
`
`S1
`
`S0
`
`t1
`
`t2
`
`t1
`
`t2
`
`t1
`
`t2
`
`(g)
`(f)
`(e)
`Figure 2. All possible shapes of an optimal speed
`function when S S.
`
`We consider the following scheduling problem:
`Ordered Scheduling Problem (OSP): Given a partition
` = t t t ::: tn(cid:0) tn = t of the time
`interval ; t, there is a workload Wi to be completed dur-
`ing the interval ti(cid:0) ; ti, we want to find the speed function
`S : ; t ! Smin; Smax, such that:
`(i)
`S = S,
`(ii)
`Smin  St  Smax,
`(ii)
`jS tj K ,
`(iv) Wi = R ti
`St dt, and
`E = R t
` P St dt is minimized.
`(v)
`For this problem, we can use the following approach.
`Let St be a speed function that we are looking for and
`xi = Sti for i = ; ; ; n. Using the Theorem 2,
`
`ti(cid:0)
`
`MICROCHIP TECHNOLOGY INC. EXHIBIT 1018
`Page 5 of 10
`
`

`

`we can express S(t) as a function of unknown variables
`x1 , x 2, · · ·, Xn· Now we can plug this expression into (v),
`get a system of (non-linear) equations involving Xi's from
`the first order condition to minimize the energy function
`(v). The system of (non-linear) equations is solved by us(cid:173)
`ing some numerical method, if the closed fonn solution can
`not be found. The speed function S(t) is determined by the
`fonnulae in Theorem 2 and the solution(s) of the system of
`(non-linear) equations.
`There are a few comments to be made about this ap(cid:173)
`proach. First, this approach is in1practical since the (non(cid:173)
`linear) system of equations is very difficult to be solved in
`many cases. Secondly, when applying Theorem 2 in this ap(cid:173)
`proach, we have to check another condition: W min ( Xi-1, Xi)
`~ wi ~ Wmax(Xi-1, Xi), which will make the problem
`even more difficult. Thirdly, since the solution(s) to the
`(non-linear) system of equations is the potential candidate
`for our minimization problem, we have to apply (at least)
`the second order condition to check it. Finally, there are
`still some more mathematical details that we have to take
`into consideration, e.g., the continuity and differentiability
`of the power function. Therefore, some good heuristics will
`be more useful in real life.
`We note that the following special case of the OSP prob(cid:173)
`lem can be solved optin1ally.
`Restricted Ordered Scheduling Problem (ROSP): With
`the same constraints as in the OSP problem, S(ti) for i =
`1, 2, • • •, n are additionally given.
`It is trivial to see that the Theorem 2 provides an optimal
`speed function for the ROSP problem.
`
`5 Global design flow
`
`In this Section we describe the global flow of the pro(cid:173)
`posed synthesis system and explain the function of each
`subtask and how these subtasks are combined into a syn(cid:173)
`thesis system.
`
`Smrm r« minirnunt pown ccir.umpcioa 0011fi,gvrllim
`d p,:ioen:,r I.Cactie O.C.Chc ror all appliatliom
`
`Chlng,e~
`
`I.CIIIW 0.C.Chco:,nl'ipalim
`
`Sy111em pcdc:nnanoe and p011u cvlluaDOfl
`Wsimublionpl.sd'ona
`
`Pm"c:nnanoe and p011u opciminlim bcuri,OC
`r« awipping buic bloch 110 I.C.Chc lillC!II
`
`Sy,11e1a perf«manoc and poor,'«
`omsumption cuimation
`
`Variable •cilutae c.k .:hcdulirtg
`
`Figure 3. The global flow of the synthesis approach.
`
`Figure 3 illustrates the synthesis system. The goal is to
`choose the configuration of processor, I-cache, and D-cache
`
`and the variable voltage task schedule with minimum power
`consumption which satisfy the timing requirements of mul(cid:173)
`tiple preemptive tasks. To accurately predict the system per(cid:173)
`formance and power consumption for target applications,
`we employ the approach which integrates the optimization,
`simulation, modeling, and profiling tools, as shown in Fig(cid:173)
`ure 5. The synthesis technique considers each non-dominated
`microprocessor core and competitive cache configuration,
`and selects the hardware setup which requires minimal power
`consumption and satisfies the individual performance re(cid:173)
`quirements of all target applications. The application-driven
`search for a low-power core and cache system requires us(cid:173)
`age of trace-driven cache simulation for each promising point
`considered in the design space. We attack this problem by
`carefully scanning the design space using search algorithms
`with sharp bounds and by providing powerful algorithmic
`perfom1ance and power estimation techniques. The search
`algorithm to find an energy-efficient system configuration is
`described using the pseudo-code shown in Figure 4.
`
`Sort processor cores in a list L in an increasing order of
`at the nominal voltage 5V
`EnerqyPerfostruction
`S11stemClockFrequenC1J
`Delete the dominated processor cores from L
`For each processor core in L in the order of appearance
`For I-cache= 512B . .32KB and CacheLineSize = 8B.512B
`For D-cache = 5128.32KB and CacheLlneSize = 8B.512B
`Check bounds; if exceeded break;
`If ( current I- and D-cache configuration has never been evaluated)
`Evaluate performance and power consumption
`Using the cache system analysis, evaluate the power consumption
`of the entire system using variable voltage task scheduling
`Memorize Configuration C if power consumption is minimal
`
`Figure 4.
`Pseudo code for the search of power
`efficient system configuration.
`Since perfonnance and power evaluation of a single pro(cid:173)
`cessor, I- and D-cache configuration requires a time-consuming
`trace-driven simulation, the goal of our search technique
`is to reduce the number of evaluated cache systems using
`sharp bounds for cache system performance and power es(cid:173)
`tunations. However, a particular cache system is evaluated
`using trace-driven simulation only once since the data re(cid:173)
`trieved from such simulation can be used for overall system
`power consumption estimation for different embedded pro(cid:173)
`cessor cores with minor additional computational expenses.
`The algorithm excludes from further consideration pro(cid:173)
`cessor cores dominated by other processor cores. One pro(cid:173)
`cessor type dominates another if it consumes less power at
`higher frequency and results in higher MIPS perfonnance
`at the same nominal power supply. The competitive proces(cid:173)
`sors are then sorted in ascending order with respect to their
`power consumption per instruction and frequency ratio. Mi(cid:173)
`croprocessors which seem to be more power-efficient are,
`therefore, given priority in the search process. This step
`provides later on sharper bounds for search termination. The
`search for the most efficient cache configuration is bounded
`
`MICROCHIP TECHNOLOGY INC. EXHIBIT 1018
`Page 6 of 10
`
`

`

`with sharp bounds. A bound is detennined by measuring
`the number of conflict misses and comparing the energy re(cid:173)
`quired to fetch the data from off-chip memory due to mea(cid:173)
`sured conflict misses and the power that would have been
`consumed by twice larger cache for the same number of
`cache accesses assuming zero cache conflicts. We tenni(cid:173)
`nate further increase of the cache structure when the power
`consumption due to larger cache would be larger than the
`energy consumed by the current best solution. Similarly,
`another bound is defined at the point when the energy re(cid:173)
`quired to fetch the data from off-chip memory due to con(cid:173)
`flict cache misses for twice smaller cache with the assump(cid:173)
`tion of zero-energy consumption per cache access, is larger
`than the energy required for both fetching data from cache
`and off-chip memory in the case of the current cache struc(cid:173)
`ture. We abort further decrease of the cache structure if the
`amount of energy required to bring the data due to addi(cid:173)
`tional cache misses from off-chip memory is larger than the
`energy consumed by the current best solution.
`When evaluating competitive hardware configurations,
`the target applications are scheduled with the variable volt(cid:173)
`age task scheduler using the predicted system performance
`and power on the configuration considered.
`
`Comrol Flow Gnpll Pmiliaaing
`
`on SHADE and DINEROIII [14, 15]. SHADE is a tracing
`tool which allows users to define custom trace analyzers and
`thus collect rich information on runtime events. SHADE
`currently profiles only SPARC binaries. The executable bi(cid:173)
`nary program is dynamically translated into host machine
`code. The tool also provides a stream of data to the trans(cid:173)
`lated code which is directly executed to simulate and trace
`the original application code. A custom analyzer composed
`of approximately 2,500 lines of C code, is linked to SHADE
`to control and analyze the generated trace information. The
`analyzer sources relevant trace infom1ation from SHADE
`and builds a control flow graph (CFG) corresponding to the
`dynamically executed code. The analysis consists of two
`passes. The first pass detennines the boundaries of basic
`blocks, while the second pass constructs a CFG by adding
`control flow infom1ation between basic blocks. We also col(cid:173)
`lect the frequencies of control transfers through each

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket