`Voltage Processors
`
`Inki Hongy, Gang Quy, Miodrag Potkonjaky, and Mani B. Srivastavaz
`yComputer Science Department, University of California, Los Angeles, CA 90095-1596 USA
`zElectrical Engineering Department, University of California, Los Angeles, CA 90095-1596 USA
`
`Abstract
`
`The energy efficiency of systems-on-a-chip can be much
`improved if one were to vary the supply voltage dynami-
`cally at run time. In this paper we describe the synthesis
`of systems-on-a-chip based on core processors, while treat-
`ing voltage (and correspondingly, the clock frequency) as a
`variable to be scheduled along with the computation tasks
`during the static scheduling step. In addition to describ-
`ing the complete synthesis design flow for these variable
`voltage systems, we focus on the problem of doing the volt-
`age scheduling while taking into account the inherent limi-
`tation on the rates at which the voltage and clock frequency
`can be changed by the power supply controllers and clock
`generators. Taking these limits on rate of change into ac-
`count is crucial since changing the voltage by even a volt
`may take time equivalent to 100s to 10,000s of instructions
`on modern processors. We present both an exact but im-
`practical formulation of this scheduling problem as a set
`of non-linear equations, as well as a heuristic approach
`based on reduction to an optimally solvable restricted or-
`dered scheduling problem. Using various task mixes drawn
`from a set of nine real-life applications, our results show
`that we are able to reduce power consumption to within 7%
`of the lower bound obtained by imposing no limit at the rate
`of change of voltage and clock frequencies.
`
`1 Introduction
`
`In recent years the demand for portable battery-operated
`computing and communication devices has made low power
`consumption an essential design attribute. The power re-
`duction approaches that have emerged so far include reduc-
`tion of switched capacitance, activity based system shut-
`down, and aggressive supply voltage reduction via exploita-
`tion of quadratic dependence of power on voltage together
`with parallelism and pipelining to recoup lost throughput.
`However, aggressive supply voltage reduction [1],the most
`powerful of these techniques, is usable only if throughput
`(data sample rate) is the sole metric of speed. A single tight
`
`latency constraint, as is often present in embedded systems,
`renders the technique ineffective.
`The problem outlined above really arises because con-
`ventional systems are designed with a fixed supply voltage.
`However, there is no fundamental reason that the supply
`voltage has to be fixed. Instead, it can in principle be varied
`dynamically at run time. Indeed, advances in power supply
`technology makes it possible to vary the generated supply
`voltage dynamically under external control. While many
`CMOS circuits have always been capable of operating over
`a range of supply voltages, it is the recent progress in power
`supply circuits [21, 9] that has made feasible systems with
`dynamically variable supply voltages. Since both the power
`consumed and the speed (maximal clock frequency) are a
`function of the supply voltage, such variable voltage sys-
`tems can be made to operate at different points along their
`power vs. speed curves in a controlled fashion.
`In particular a static or dynamic scheduler, in addition
`to its conventional task of scheduling the computation op-
`erations on hardware resources, may also schedule changes
`in voltage as a function of timing constraints and chang-
`ing system state. Such voltage scheduling would allow for
`much higher energy efficiency (lower power) for a wider
`class of applications than is possible by operating the sys-
`tem at one or two fixed points on the power-speed curve,
`as is done by conventional approaches of supply voltage re-
`duction to a fixed value [1] and system shutdown [25]. The
`benefits of quadratic dependence of power on voltage thus
`become available even in event driven systems as well as in
`the presence of joint latency and throughput constraints
`
`In this paper, we develop the first approach for power
`minimization of scheduling, instruction and data cache size
`determination, and processor core selection. We establish
`the theoretical framework for designing variable voltage sys-
`tems by treating voltage as an optimization degree of free-
`dom for the applications with real-time constraints and pro-
`viding the optimal variable voltage scheduling algorithm for
`some special cases, where the speed overhead for chang-
`ing voltages is explicitly accounted for. We develop an ef-
`fective scheduling heuristic for a general case based on the
`
`MICROCHIP TECHNOLOGY INC. EXHIBIT 1018
`Page 1 of 10
`
`
`
`optimal algorithm for the restricted problem. By selecting
`the most efficient voltage profile in the presence of multiple
`timing constraints under the realistic variable voltage hard-
`ware model, our algorithms result in significant savings in
`energy.
`The rest of the paper is organized in the following way.
`Section 2 presents the related work. Section 3 explains
`the necessary background. Section 4 discusses the variable
`voltage scheduling problem and provides an optimal solu-
`tion to some special cases. The global design flow of the
`novel synthesis approach is presented in Section 5. In Sec-
`tion 6 we establish the computational complexities of the
`variable voltage scheduling problem and propose an effec-
`tive heuristic based on the optimal algorithm for some spe-
`cial cases. Section 7 presents experimental data. Section 8
`concludes.
`
`2 Related work
`
`Low power system synthesis has emerged as an impor-
`tant area of research in last 5 years. Several good reviews
`on power estimation and minimization techniques exist [20,
`23]. We review the research results relevant to low power
`systems based on dynamically variable voltage hardware.
`
`2.1 Variable voltage system and design issues
`
`Of direct relevance to our research is the prior research
`activity on technology, circuits, and techniques for vari-
`able voltage systems. At the technology level, efficient DC-
`DC converters that allow the output voltage to be rapidly
`changed under external control have recently been devel-
`oped [21, 26].
`At the hardware design level, research work has been
`performed on chips with dynamically variable supply volt-
`age that can be adjusted based on (i) process and temper-
`ature variations, and (ii) processing load as measured by
`the number of data samples queued in the input (or output)
`buffer. Dynamically adapting voltage (and the clock fre-
`quency), to operate at the point of lowest power consump-
`tion for given temperature and process parameters was first
`suggested by Macken et. al. [19]. Nielsen et al. [22] ex-
`tended the dynamic voltage adaptation idea to take into ac-
`count data dependent computation times in self-timed cir-
`cuits. Recently, researchers at MIT [9] have extended the
`idea of voltage adaptation based on data dependent compu-
`tation time from [22] to synchronously clocked circuits. Be-
`cause these approaches rely on run-time reactive approaches
`to dynamic voltage adaptation, they work fine only where
`average throughput. is the metric of performance. There-
`fore, these approaches are inapplicable to hard real time
`systems such as embedded control applications with strict
`timing requirements.
`
`There has been research on scheduling strategies for ad-
`justing CPU speed so as to reduce power consumption. Most
`existing work is in the context of non-real-time workstation-
`like environment. Weiser et al. [29] proposed an approach
`where time is divided into 10-50 msec intervals, and the
`CPU clock speed (and voltage) is adjusted by the task-level
`scheduler based on the processor utilization over the pre-
`ceding interval. Govil et al. [8] enhanced [29] by propos-
`ing and comparing several predictive and non-predictive ap-
`proaches for voltage changes, and concluded that smooth-
`ing helps more than prediction. Finally, Yao et al. [31] de-
`scribed an off-line minimum energy scheduling algorithm
`and an on-line average rate heuristic for job scheduling with
`preemption for independent processes with deadlines under
`the assumption that the supply voltage can be changed arbi-
`trarily without any timing and power overhead.
`Hong et al. [10] proposed an approach for the low power
`core-based real-time system-on-chip based on dynamically
`variable voltage hardware. While they developed the non-
`preemptive variable voltage scheduling heuristic with the
`assumption of zero delay in changing voltage levels, in this
`paper we focus on the preemptive variable voltage schedul-
`ing while taking into account the inherent limitation on the
`rates at which voltage and clock frequency can be changed
`by the power supply controllers and clock generators.
`
`2.2 System level and behavioral level power esti-
`mation
`
`Landman and Rabaey have performed the work on the
`macro and micro level power estimation [16]. The work at
`Princeton and Fujitsu on the estimation of power consump-
`tion for programmable processors [28] is of particular rele-
`vance. Evans and Franzon [5] and Su and Despain [27] have
`proposed power estimation model for cache, specifically for
`SRAM.
`
`2.3 System level and behavioral level power opti-
`mization
`
`One well known technique that results in substantial power
`reduction is reducing the voltage to reduce the switching
`power, which is the dominant source of power dissipation
`in CMOS circuit and is proportional to V
`dd where Vdd is
`the supply voltage [1]. However, this reduction came with
`a speed penalty due to increased gate delays. The gate de-
`Vdd
`lay is proportional to
`Vdd(cid:0)VT , where VT is the threshold
`voltage [1]. Chandrakasan et al. [1] have shown that the
`strategy of operating at a fixed reduced voltage can be cou-
`pled with architectural level parallelism and pipelining to
`compensate for lower clock rate due to voltage reduction so
`that the overall throughput remains the same but the over-
`all power is still lowered, although at the cost of increased
`latency.
`
`MICROCHIP TECHNOLOGY INC. EXHIBIT 1018
`Page 2 of 10
`
`
`
`Microprocessor core
`StrongARM
`ARM, 7
`ARM, 7 Low-Power
`LSI Logic, TR4101
`LSI Logic, CW4001
`LSI Logic, CW4011
`DSP Group, Oak
`NEC, R4100
`Toshiba, R3900
`Motorola, 68000
`PowerPC, 403
`ARM 710 (ARM7 / 8KB)
`SA-110 (StrongARM / 32KB)
`
`Clock (MHz) MIPS
`233
`266
`40
`36
`27
`24
`81
`30
`60
`53
`80
`120
`80
`80
`40
`40
`50
`50
`33
`16
`33
`41
`40
`72
`200
`230
`
`Technology (m)
`0.35
`0.6
`0.6
`0.35
`0.5
`0.5
`0.6
`0.35
`0.6
`0.5
`0.5
`0.6
`0.35
`
`Area (mm)
`4.3
`5.9
`3.8
`2
`3.5
`7
`8.4
`5.4
`15
`4.4
`7.5
`34
`50
`
`Power diss. (mW ) (Volt.)
`300 (1.65)
`200 (5)
`45 (3.3)
`81 (3.3)
`120 (3.3)
`280 (3.3)
`190 (5)
`120 (3.3)
`400 (3.3)
`35 (3.3)
`40 (3.3)
`424 (5)
`900 (1.65)
`
`Table 1. The performance, area, and power data for a subset of processor cores.
`
`Several researchers have addressed the issue of power in
`event-driven systems, and proposed various techniques for
`shutting down the system or parts of the system [25, 12,
`4]. Compilation techniques for low power software have
`emerged for both general-purpose computation [28] and DSP
`computations [11]. Numerous behavioral synthesis research
`efforts have also addressed power minimization [23].
`Four research groups have addressed the use of multiple
`(in their software implementation restricted to two or three)
`different voltages [2, 13, 18, 24]. They used the term “vari-
`able voltage” for a fixed number of simultaneously available
`voltages.
`
`3 Preliminaries
`
`3.1 Task model
`
`A set T of independent tasks is to be executed on a
`system-on-chip. Each task Ti T is associated with the
`following parameters:
` ai its arrival time
` di its deadline
` pi its period
` Wi its required number of CPU cycles
`We assume, without loss of generality, that all tasks have
`identical periods. When this is not the case, a simple prepro-
`cessing step and application of the least common multiple
`(LCM) theorem [17] transforms an arbitrary set of periods
`to this design scenario in polynomial time. As the conse-
`quence, when the LCM theorem is applied, there may be
`a need to run several iterations of a given task within this
`overall period.
`
`3.2 Variable voltage hardware model
`
`The variable voltage generated by the DC-DC switch-
`ing regulators in the power supply cannot instantaneously
`make a transition from one voltage to another. For exam-
`ple, [21] reported transition times of 6 msec/volt for a DC-
`DC switching converter. In personal communications with
`
`the authors, researchers at MIT and Berkeley have reported
`transitions times of 10 to 100 microsec/volt.
`Dual to the supply voltage variation is the accompanying
`variation of the clock frequency. The time overhead asso-
`ciated with voltage switching is significant in the order of
`100s to 10000s of instructions in modern microprocessor.
`Fortunately, the computation itself can continue during the
`voltage and frequency change period. However, the speed
`overhead during the transition period must be explicitly ac-
`counted for.
`As suggested in [21], we employ a linear model to de-
`scribe the speed overhead. The maximum rate of voltage
`change is specified for the power supply regulator, and we
`can make a transition from a voltage to any other voltages
`within the maximum rate.
`
`3.3 Power model
`
`It is well known that there are three principal compo-
`nents of power consumption in CMOS integrated circuits:
`switching power, short-circuit power, and leakage power.
`The switching power, which dominates power consump-
`
`tion, is given by P = CLV ddfclock, as indicated earlier.
` CL is defined to be effective switched capacitance. It is
`also known that reduced voltage operation comes at the cost
`of reduced throughput [1]. The gate delay T follows the fol-
`Vdd
`lowing formula: T = k
`Vdd(cid:0)VT where k is a constant [1]
`and VT is the threshold voltage. From these equations to-
`gether with the observation that the speed is proportional to
`f and inversely proportional to the gate delay, the power vs.
`speed curve can be derived. In particular, the normalized
`power Pn (= 1 at Vdd = Vref ) as a function of the normal-
`ized speed Sn (= 1 at Vdd = Vref ). For example, assuming
`Vref = 3.3 volts and VT = 0.8 volts, the power vs. speed
`curve follows the following equation:
`
`n +p : S
`Pn = : S
`n + : Sn :
`S
`n + : Sn + : S
`n + : Sn
`By varying Vdd, the system can be made to operate at
`different points along this curve. From the equation, it is
`easy to see that the power is a convex function of speed.
`
`MICROCHIP TECHNOLOGY INC. EXHIBIT 1018
`Page 3 of 10
`
`
`
`3.4 Target architecture
`
`Several factors combine to influence system performance:
`instruction and data cache miss rates and penalty, processor
`performance, and system clock speed. Power dissipation of
`the system is estimated using processor power dissipation
`per instruction and the number of executed instructions per
`task, supply voltage, energy required for cache read, write,
`and off-chip data access as well as the profiling information
`about the number of cache and off-chip accesses.
`Data on microprocessor cores have been extracted from
`manufacturer’s datasheets as well as from the CPU Center
`Info web site [3]. A sample of the collected data is pre-
`sented in Table 1, The last two rows of the table show two
`integrated microprocessor products with on-chip caches and
`their system performance data. We assume that the proces-
`sor cores follow the power model described in the preceding
`subsection.
`
`Line size
`Cache
`16B
`32B
`64B
`128B
`256B
`512B
`size
`8B
`5 99
`6 02
`-
`-
`-
`-
`512B
`6 07
`6 13
`6 07
`6 23
`-
`-
`-
`1KB
`6 44
`6 52
`6 36
`6 34
`6 51
`-
`-
`2KB
`6 88
`7 02
`6 66
`6 49
`6 56
`7 35
`-
`4KB
`7 67
`7 81
`7 21
`6 99
`6 91
`7 65
`9 40
`8KB
`8 34
`8 58
`8 00
`7 62
`7 54
`8 14
`9 80
`16KB
`9 30
`9 45
`8 91
`8 59
`8 69
`9 30
`10 04
`32KB
`1 04e-08
`Table 2.
`A subset of the cache latency model:
`minimal cycle time (ns) for various direct-mapped
`caches with variable line sizes.
`We use CACTI [30] as a cache delay estimation tool with
`respect to the main cache design choices: size, associativ-
`ity, and line size. The energy model has been adopted from
`[5] and [27]. The overall cache model has been scaled with
`respect to actual industrial implementations. Caches typi-
`cally found in current embedded systems range from 128B
`to 32KB. Although larger caches correspond to higher hit
`rates, their power consumption is proportionally higher, re-
`sulting in an interesting design trade-off. Higher cache as-
`sociativity results in significantly higher access time. We
`use a recently developed compiler strategy which efficiently
`minimizes the number of cache conflicts that considers direct-
`mapped caches [14]. We have experimented with 2-way
`set associative caches which did not dominate comparable
`direct-mapped caches in a single case. Cache line size was
`also variable in our experimentations. Its variations corre-
`sponded to the following trade-off: larger line size results in
`higher cache miss penalty delay and higher power consump-
`tion by the sense amplifiers and column decoders, while
`smaller line size results in large cache decoder power con-
`sumption. Extreme values result in significantly increased
`access time. We estimated the cache miss penalty based
`on the operating frequency of the system and external bus
`width and clock for each system investigated. This penalty
`ranged between 4 and 20 system clock cycles. Write-back
`was adopted in opposed to write-through, since it is proven
`
`No optimizations
`
`Cache size
`512B
`1KB
`2KB
`4KB
`8KB
`16KB
`32KB
`
`8B
`0.330
`0.356
`0.422
`0.651
`1.146
`2.158
`4.198
`
`16B
`0.378
`0.394
`0.444
`0.666
`1.156
`2.164
`4.202
`
`32B
`0.468
`0.463
`0.489
`0.694
`1.175
`2.174
`4.209
`
`Block buffering, sub-banking
`and Gray code addressing [27]
`8B
`16B
`32B
`0.291
`0.322
`0.401
`0.295
`0.299
`0.345
`0.317
`0.271
`0.271
`0.456
`0.334
`0.273
`0.769
`0.504
`0.347
`1.412
`0.869
`0.530
`2.702
`1.608
`0.922
`
`Table 3. A subset of the cache power consumption
`model: power consumption (nJ) estimation for vari-
`ous direct-mapped caches with variable line sizes.
`
`to provide superior performance and especially power sav-
`ings in uniprocessor systems at increased hardware cost [15].
`Each of the processors considered is able to issue at most a
`single instruction per clock period. Thus, caches were de-
`signed to have a single access port. A subset of the cache
`model data is given in Tables 2 and 3. Cache access delay
`and power consumption model were computed for a number
`of organizations and sizes, assuming the feature size of 0.5
`m and typical six transistors per CMOS SRAM cell im-
`plementation. The nominal energy consumption per single
`off-chip memory access, nJ, is adopted from [6].
`
`4 Theory of variable voltage scheduling: spe-
`cial solvable cases
`
`Yao, Demers and Shenker [31] have provided the opti-
`mal preemptive static scheduling algorithm for a set of in-
`dependent tasks with arbitrary arrival times and deadlines.
`The algorithm is based on the concept of critical regions
`and the tasks in critical regions are scheduled by the ear-
`liest deadline first (EDF) algorithm. The running time of
`the algorithm is Onlogn for n tasks. When there is a
`limit on the maximum voltage change rates, then the prob-
`lem becomes NP-complete, which is shown by the reduc-
`tion from the SEQUENCING WITH DEADLINES AND
`SETUP TIMES problem in page 238 of [7]. In this Sec-
`tion, we discuss some special cases which can be solved
`optimally.
`We find the speed function St of the processor to mini-
`mize the energy consumption during the time period t ; t.
`The voltage function V t can be easily obtained from St
`[1]. We assume that the speed of the processor can be
`changed between Smin and Smax with a maximum change
`rate K, i.e., at any time t, the speed St satisfies:
`(1)
`Smin St Smax
`(2)
`jS tj K
`The amount of work that the processor completes during
`time interval t ; t is given by:
`(3)
`St dt
`t
`The power, or energy consumed per unit time, is a con-
`vex function P S of the processor’s speed. The function
`
`W = R t
`
`MICROCHIP TECHNOLOGY INC. EXHIBIT 1018
`Page 4 of 10
`
`
`
`E = R t
`
`(4)
`
`P(S) depends on the technology considered. The total en-
`ergy consumed during the time interval t ; t is:
`P St dt
`t
`Lemma 1. Starting with speed S , the work W that the
`processor can complete, in the time interval t ; t, falls
`into the range Wmin; Wmax, with
`Wmin = Sct (cid:0) t + S (cid:0)Sc
`Wmax = Sdt (cid:0) t (cid:0) S (cid:0)Sd
`K
`where Sc = maxSmin; S (cid:0) Kt (cid:0) t and
`Sd = minSmax; S + Kt (cid:0) t .
`
`K
`
`S0
`
`Smin
`
`S0
`
`,--~
`ll"
`u._-=i_ [LL
`
`Smin
`
`t1
`
`t2
`
`t1
`
`t2
`
`Smax
`
`S0
`
`(b)
`
`,---77 I
`/1
`[LL
`[LJ_
`(c)
`
`Smax
`
`S0
`
`t1
`
`t2
`
`t1
`
`t2
`
`(a)
`Figure 1.
`Illustrated Wmin, Wmax for Lemma 1.
`(a) Wmin if S (cid:0) Kt (cid:0) t Smin
`(b) Wmin if S (cid:0) Kt (cid:0) t Smin
`(c) Wmax if S + Kt (cid:0) t Smax
`(d) Wmax if S + Kt (cid:0) t Smax
`
`(d)
`
`Since power is a convex function of the processor’s speed,
`it is obvious that if we can start from any speed, to complete
`a given workload W in any time interval of fixed length
`
` t, we should run our processor at the constant speed W t .
`When we do not have the freedom to choose the starting
`speed, the following theorem shows that the best we can do
`is to change the speed as fast and early as possible until we
`reach a critical point, then keep the constant speed.
`
`S (cid:0) Ktc (cid:0) t ;
`
`S + Ktc (cid:0) t ;
`
`Theorem 1. With the same constraints as in Lemma 1, for
`any workload W Wmin; Wmax, there is a unique speed
`function S : t ; t ! Smin; Smax such that (1), (2),
`and (3) are all satisfied and the energy (4) is minimized.
`Moreover, St is defined as:
`if Wmin W S t (cid:0) t ,
`if t t tc;
`St = S (cid:0) Kt (cid:0) t ;
`if tc t t.
`where tc = t (cid:0)qt (cid:0) t +
`K W (cid:0) S t (cid:0) t .
`and if S t (cid:0) t W Wmax ,
`if t t tc;
`St = S + Kt (cid:0) t ;
`if tc t t.
`where tc = t (cid:0)qt (cid:0) t (cid:0)
`K W (cid:0) S t (cid:0) t .
`Next we consider the case when there is a finishing speed
`constraint. Similarly, we have the following results:
`Lemma 2. Starting with speed S , to reach the finishing
`speed S at the end of the time interval t ; t, the work
`W that the processor has to complete falls into the range
`Wmin; Wmax, with
`
`
`Wmin = Sct (cid:0) t + S (cid:0)ScK + S (cid:0)Sc
`
`K
`
`Wmax = Sdt (cid:0) t (cid:0) S (cid:0)SdK (cid:0) S (cid:0)Sd
`
`K
`(cid:0) K t (cid:0)t
`where Sc = maxSmin; S +S
` and
`
`
`Sd = minSmax; S +S
`+ K t(cid:0)t
`.
`
`
`Theorem 2. With the same constraints as in Lemma 2, for
`any workload W Wmin; Wmax, there is a unique speed
`function S : t ; t ! Smin; Smax such that (1), (2), and
`(3) are satisfied and the energy (4) is minimized. The speed
`function will be a step function that satisfies the following:
`if Wmin W W ,
`St =
`S (cid:0) Kt (cid:0) t ;
`S (cid:0) Kx (cid:0) t ;
`S (cid:0) Kt (cid:0) t;
`:
`if W W W,
`St =
`S + Kt (cid:0) t ;
`S + Kx (cid:0) t ;
`S (cid:0) Kt (cid:0) t;
`:
`and if W W Wmax ,
`if t t x ;
`St =
`S + Kt (cid:0) t ;
`if x t x;
`S + Kx (cid:0) t ;
`if x t t.
`S + Kt (cid:0) t;
`:
`where W ; W; x ; x are constants.
`
`if t t x ;
`if x t x;
`if x t t.
`
`if t t x ;
`if x t x;
`if x t t.
`
`From Theorem 2, it is simple to see that the speed func-
`tions which minimize energy are of some restricted shapes.
`For example, if S S , then St has to be one of the
`following seven shapes in Figure 2.
`
`S1
`
`S0
`
`S1
`
`S0
`
`S1
`
`S0
`
`S1
`
`S0
`
`t1
`
`t2
`
`(a)
`
`t1
`
`t2
`
`(b)
`
`t1
`
`t2
`
`(c)
`
`t1
`
`t2
`
`(d)
`
`S1
`
`S0
`
`S1
`
`S0
`
`S1
`
`S0
`
`t1
`
`t2
`
`t1
`
`t2
`
`t1
`
`t2
`
`(g)
`(f)
`(e)
`Figure 2. All possible shapes of an optimal speed
`function when S S .
`
`We consider the following scheduling problem:
`Ordered Scheduling Problem (OSP): Given a partition
` = t t t ::: tn(cid:0) tn = t of the time
`interval ; t, there is a workload Wi to be completed dur-
`ing the interval ti(cid:0) ; ti, we want to find the speed function
`S : ; t ! Smin; Smax, such that:
`(i)
`S = S ,
`(ii)
`Smin St Smax,
`(ii)
`jS tj K ,
`(iv) Wi = R ti
`St dt, and
`E = R t
` P St dt is minimized.
`(v)
`For this problem, we can use the following approach.
`Let St be a speed function that we are looking for and
`xi = Sti for i = ; ; ; n. Using the Theorem 2,
`
`ti(cid:0)
`
`MICROCHIP TECHNOLOGY INC. EXHIBIT 1018
`Page 5 of 10
`
`
`
`we can express S(t) as a function of unknown variables
`x 1 , x 2, · · ·, Xn· Now we can plug this expression into (v),
`get a system of (non-linear) equations involving Xi's from
`the first order condition to minimize the energy function
`(v). The system of (non-linear) equations is solved by us(cid:173)
`ing some numerical method, if the closed fonn solution can
`not be found. The speed function S(t) is determined by the
`fonnulae in Theorem 2 and the solution(s) of the system of
`(non-linear) equations.
`There are a few comments to be made about this ap(cid:173)
`proach. First, this approach is in1practical since the (non(cid:173)
`linear) system of equations is very difficult to be solved in
`many cases. Secondly, when applying Theorem 2 in this ap(cid:173)
`proach, we have to check another condition: W min ( Xi-1, Xi)
`~ wi ~ Wmax(Xi-1, Xi), which will make the problem
`even more difficult. Thirdly, since the solution(s) to the
`(non-linear) system of equations is the potential candidate
`for our minimization problem, we have to apply (at least)
`the second order condition to check it. Finally, there are
`still some more mathematical details that we have to take
`into consideration, e.g., the continuity and differentiability
`of the power function. Therefore, some good heuristics will
`be more useful in real life.
`We note that the following special case of the OSP prob(cid:173)
`lem can be solved optin1ally.
`Restricted Ordered Scheduling Problem (ROSP): With
`the same constraints as in the OSP problem, S(ti) for i =
`1, 2, • • •, n are additionally given.
`It is trivial to see that the Theorem 2 provides an optimal
`speed function for the ROSP problem.
`
`5 Global design flow
`
`In this Section we describe the global flow of the pro(cid:173)
`posed synthesis system and explain the function of each
`subtask and how these subtasks are combined into a syn(cid:173)
`thesis system.
`
`Smrm r« minirnunt pown ccir.umpcioa 0011fi,gvrllim
`d p,:ioen:,r I.Cactie O.C.Chc ror all appliatliom
`
`Chlng,e~
`
`I.CIIIW 0.C.Chco:,nl'ipalim
`
`Sy111em pcdc:nnanoe and p011u cvlluaDOfl
`Wsimublionpl.sd'ona
`
`Pm"c:nnanoe and p011u opciminlim bcuri,OC
`r« awipping buic bloch 110 I.C.Chc lillC!II
`
`Sy,11e1a perf«manoc and poor,'«
`omsumption cuimation
`
`Variable •cilutae c.k .:hcdulirtg
`
`Figure 3. The global flow of the synthesis approach.
`
`Figure 3 illustrates the synthesis system. The goal is to
`choose the configuration of processor, I-cache, and D-cache
`
`and the variable voltage task schedule with minimum power
`consumption which satisfy the timing requirements of mul(cid:173)
`tiple preemptive tasks. To accurately predict the system per(cid:173)
`formance and power consumption for target applications,
`we employ the approach which integrates the optimization,
`simulation, modeling, and profiling tools, as shown in Fig(cid:173)
`ure 5. The synthesis technique considers each non-dominated
`microprocessor core and competitive cache configuration,
`and selects the hardware setup which requires minimal power
`consumption and satisfies the individual performance re(cid:173)
`quirements of all target applications. The application-driven
`search for a low-power core and cache system requires us(cid:173)
`age of trace-driven cache simulation for each promising point
`considered in the design space. We attack this problem by
`carefully scanning the design space using search algorithms
`with sharp bounds and by providing powerful algorithmic
`perfom1ance and power estimation techniques. The search
`algorithm to find an energy-efficient system configuration is
`described using the pseudo-code shown in Figure 4.
`
`Sort processor cores in a list L in an increasing order of
`at the nominal voltage 5V
`EnerqyPerfostruction
`S11stemClockFrequenC1J
`Delete the dominated processor cores from L
`For each processor core in L in the order of appearance
`For I-cache= 512B . .32KB and CacheLineSize = 8B.512B
`For D-cache = 5128.32KB and CacheLlneSize = 8B.512B
`Check bounds; if exceeded break;
`If ( current I- and D-cache configuration has never been evaluated)
`Evaluate performance and power consumption
`Using the cache system analysis, evaluate the power consumption
`of the entire system using variable voltage task scheduling
`Memorize Configuration C if power consumption is minimal
`
`Figure 4.
`Pseudo code for the search of power
`efficient system configuration.
`Since perfonnance and power evaluation of a single pro(cid:173)
`cessor, I- and D-cache configuration requires a time-consuming
`trace-driven simulation, the goal of our search technique
`is to reduce the number of evaluated cache systems using
`sharp bounds for cache system performance and power es(cid:173)
`tunations. However, a particular cache system is evaluated
`using trace-driven simulation only once since the data re(cid:173)
`trieved from such simulation can be used for overall system
`power consumption estimation for different embedded pro(cid:173)
`cessor cores with minor additional computational expenses.
`The algorithm excludes from further consideration pro(cid:173)
`cessor cores dominated by other processor cores. One pro(cid:173)
`cessor type dominates another if it consumes less power at
`higher frequency and results in higher MIPS perfonnance
`at the same nominal power supply. The competitive proces(cid:173)
`sors are then sorted in ascending order with respect to their
`power consumption per instruction and frequency ratio. Mi(cid:173)
`croprocessors which seem to be more power-efficient are,
`therefore, given priority in the search process. This step
`provides later on sharper bounds for search termination. The
`search for the most efficient cache configuration is bounded
`
`MICROCHIP TECHNOLOGY INC. EXHIBIT 1018
`Page 6 of 10
`
`
`
`with sharp bounds. A bound is detennined by measuring
`the number of conflict misses and comparing the energy re(cid:173)
`quired to fetch the data from off-chip memory due to mea(cid:173)
`sured conflict misses and the power that would have been
`consumed by twice larger cache for the same number of
`cache accesses assuming zero cache conflicts. We tenni(cid:173)
`nate further increase of the cache structure when the power
`consumption due to larger cache would be larger than the
`energy consumed by the current best solution. Similarly,
`another bound is defined at the point when the energy re(cid:173)
`quired to fetch the data from off-chip memory due to con(cid:173)
`flict cache misses for twice smaller cache with the assump(cid:173)
`tion of zero-energy consumption per cache access, is larger
`than the energy required for both fetching data from cache
`and off-chip memory in the case of the current cache struc(cid:173)
`ture. We abort further decrease of the cache structure if the
`amount of energy required to bring the data due to addi(cid:173)
`tional cache misses from off-chip memory is larger than the
`energy consumed by the current best solution.
`When evaluating competitive hardware configurations,
`the target applications are scheduled with the variable volt(cid:173)
`age task scheduler using the predicted system performance
`and power on the configuration considered.
`
`Comrol Flow Gnpll Pmiliaaing
`
`on SHADE and DINEROIII [14, 15]. SHADE is a tracing
`tool which allows users to define custom trace analyzers and
`thus collect rich information on runtime events. SHADE
`currently profiles only SPARC binaries. The executable bi(cid:173)
`nary program is dynamically translated into host machine
`code. The tool also provides a stream of data to the trans(cid:173)
`lated code which is directly executed to simulate and trace
`the original application code. A custom analyzer composed
`of approximately 2,500 lines of C code, is linked to SHADE
`to control and analyze the generated trace information. The
`analyzer sources relevant trace infom1ation from SHADE
`and builds a control flow graph (CFG) corresponding to the
`dynamically executed code. The analysis consists of two
`passes. The first pass detennines the boundaries of basic
`blocks, while the second pass constructs a CFG by adding
`control flow infom1ation between basic blocks. We also col(cid:173)
`lect the frequencies of control transfers through ea