throbber
Task Mapping
`
`Dynamic Task Mapping
`for MPSoCs
`
`Ewerson Luiz de Souza Carvalho, Ney Laert Vilar Calazans,
`and Fernando Gehm Moraes
`Pontifı´cia Universidade Cato´ lica do Rio Grande do Sul
`
`Multiprocessor-system-on-a-chip (MPSoC) applications can consist of a vary-
`ing number of simultaneous tasks and can change even after system design,
`enforcing a scenario that requires the use of dynamic task mapping. This article
`investigates dynamic task-mapping heuristics targeting reduction of network
`congestion in network-on-chip (NoC)-based MPSoCs. The proposed heuristics
`achieve up to 31% smaller channel load and up to 22% smaller packet latency
`than other heuristics.
`
`In this article, we investigate the per-
`formance of mapping algorithms for
`heterogeneous MPSoCs targeting dy-
`namic workloads. Our main objective
`is to minimize congestion inside the
`network on chip (NoC) through chan-
`nel usage optimization. The perfor-
`mance parameters we look at include
`execution time, NoC channel
`load,
`and packet latency. We also compare static and dy-
`namic mapping approaches in terms of their advan-
`tages and drawbacks in application scenarios that
`we simulated in two sets of experiments.
`
`Static mapping techniques
`Static mapping techniques are suitable for specific
`platforms that don’t need the insertion of new applica-
`tions into the system at runtime. Because the mapping
`algorithm’s running time is not a main concern, static
`mapping can use thorough information about the sys-
`tem to make decisions. This information includes the
`application graph’s topology, volume and rate of inter-
`task communication, and total system usage.
`Research into static mapping includes the work of
`Lei and Kumar, who proposed a genetic mapping al-
`gorithm to optimize application execution time.1 In
`their work, graphs represent applications and the tar-
`get architecture is an NoC. Wu, Al-Hashimi, and Eles
`also investigated genetic mapping algorithms.2 By
`combining dynamic voltage scaling techniques with
`mapping, they achieved a 51% savings in energy con-
`sumption. Murali and De Micheli explored mappings
`for more than one application in NoC design, using
`the tabu search (TS) algorithm.3 Manolache, Eles,
`and Peng investigated task mapping in NoCs, trying
`to guarantee packet latency.4 For this purpose, both
`the task-mapping algorithm (TS) and the routing algo-
`rithm are defined at design time. Hu and Marculescu
`
`ŠTHE EVOLUTION OF deep-submicron technology
`has dramatically increased IC density, enabling the
`development of multiprocessor systems on a chip
`(MPSoCs), SoCs containing multiple processing ele-
`ments (PEs). A homogeneous MPSoC contains a set
`of identical PEs, typically programmable processors.
`Whereas homogeneous MPSoCs simplify task migra-
`tion, heterogeneous MPSoCs support a wider variety
`of applications because they integrate distinct PEs.
`Recently, Intel developed a homogeneous MPSoC
`with 80 PEs, following a trend of growing MPSoC
`complexity.
`Applications (such as multimedia and networking)
`running in MPSoCs generally present a dynamic task
`workload. This implies a varying number of tasks run-
`ning at any given moment, possibly exceeding the
`available hardware resources. Thus, it is necessary to
`control task operation and system resource use, includ-
`ing dynamic management of task-loading actions.
`Task mappingwhich consists of finding a place-
`ment for a set of tasks that meets a specific require-
`ment such as energy consumption savingsis an
`important issue. Mapping decisions can drastically in-
`fluence system performance. If we use the moment at
`which a task is defined as a classification criterion,
`task-mapping approaches can be either static or dy-
`namic. Static mapping defines task placement at de-
`sign time; dynamic mapping defines task placement
`at runtime.
`
`0740-7475/10/$26.00c 2010 IEEE
`
`26
`
`Copublished by the IEEE CS and the IEEE CASS
`
`IEEE Design & Test of Computers
`
`1
`
`MICROSOFT 1008
`
`

`

`RL
`
`ISP
`
`RL
`
`IP
`
`R
`
`R
`
`R
`
`...
`ISP
`...
`ISP
`...
`ISP
`....
`R
`ISP
`
`..
`
`R
`
`R
`
`R
`
`...
`
`R
`
`ISP
`
`ISP
`
`IP
`
`ISP
`
`R
`
`R
`
`R
`
`...
`
`R
`
`IP
`
`RL
`
`ISP
`
`RL
`
`R
`
`R
`
`R
`
`...
`
`R
`
`Manager processor (MP)
`
`Resource control
`
`Task scheduling
`
`Task binding
`
`Task mapping
`
`Configuration
`control
`
`Task migration
`
`Task memory
`
`Device
`configuration port
`
`presented a branch-and-bound
`algorithm to map a set of
`IP
`cores (IPs) into an NoC with
`bandwidth reservation.5 Their
`results show energy savings of
`51.7% in the communication ar-
`chitecture. Marcon et al. investi-
`gated how to map modules into
`an NoC, targeting low energy
`consumption.6 They compared
`several
`algorithms, using
`a
`model that characterizes appli-
`cations by their intertask com-
`munication volume.
`
`Figure 1. Conceptual MPSoC organization. The shaded blocks inside the MP are
`
`the modules addressed in this article. The Rs are NoC routers.
`
`Dynamic mapping
`techniques
`In dynamic mapping, the time taken to map each
`task is crucial, since it influences overall application
`execution time. To reduce mapping overhead, it is im-
`portant to use greedy algorithms because they can
`trade search space exploration quality for fast results.
`Although execution time for mapping is an important
`cost function, several works don’t consider this over-
`head.7,8 These researchers assumed that overall sys-
`tem execution time compensates for the time spent
`mapping each task. On this assumption, many
`researchers used the same strategies in dynamic sce-
`narios as those used in static scenarios. For example,
`Murali and De Micheli used simulated annealing (SA)
`algorithms for static mapping, and Ngouanga et al.
`used SA for dynamic mapping.3,7
`Smit, Hurink, and Smit presented an iterative hier-
`archical approach to mapping an application to an
`NoC-based SoC at runtime.8 A set of communicating
`tasks models the application. Aiming at energy con-
`sumption savings, the mapping algorithm tries to
`place each task near its communicating PEs. Similarly,
`besides the SA algorithm, Ngouanga et al. proposed a
`force-directed mapping algorithm, which aims to keep
`communicating tasks close to each other.7 Mehran,
`Khademzadeh, and Saeidi presented a mapping
`algorithm that searches for a placement following
`a spiral path, tending to place communicating tasks
`together.9 This method executes for a single applica-
`tion without cost function evaluation. Chou, Ogras,
`and Marculescu combined task-mapping and multi-
`ple-voltage-level schemes, allowing reduction of com-
`munication overhead and energy.10 They achieved
`a communication energy savings of up to 50%.
`
`Al Faruque, Krist, and Henkel suggested a distributed
`agent-based mapping approach for larger MPSoCs,
`applying it to a 32  64 MPSoC.11 The authors claim
`to achieve on average a 7.1 times lower computational
`effort for the mapping algorithm than with a simple
`nearest-neighbor heuristic.
`
`Target MPSoC organization
`A heterogeneous MPSoC is a set of different PEs
`interacting through a communication network. In
`our proposed NoC-based heterogeneous MPSoC
`model, each PE supports the execution of one task
`at a time. PEs can support either hardware or software
`task execution. Software tasks execute in instruction
`set processors (ISPs), and hardware tasks in reconfig-
`urable logic (RL) or dedicated IPs.
`Figure 1 illustrates the proposed heterogeneous
`MPSoC organization. Among the available resources,
`one processorthe manager processor (MP)is re-
`sponsible for resource control, configuration control,
`and four task manipulation issues: scheduling, bind-
`ing, mapping, and migration. The MP fires each appli-
`cation’s initial task. When a running task needs to
`communicate with another task not yet mapped,
`the latter is loaded in a PE from task memory.
`A directed graph, in which vertices represent soft-
`ware or hardware tasks and edges define communi-
`cating PE pairs, models each application. Each edge
`defines a master-slave communication. The edge
`source vertex is the master of the communication,
`and the other is the slave. Each edge receives four
`parameters, which represent the volume and rate of
`data sent from master to slave and from slave to mas-
`ter. Similar models are common in several mapping
`algorithms.5,6,10 However, unlike our algorithm, these
`
`September/October 2010
`
`27
`
`2
`
`

`

`Task Mapping
`
`algorithms consider only the communication volume
`for the application characterization graph (ACG),5
`ignoring transmission rate information.
`Intertask communications use messages transmit-
`ted through the network. Four message types com-
`prise the adopted communication protocol. A
`request message from a running task to the MP con-
`tains the identification of a new task to be inserted
`in the system. A release message notifies the MP
`that a processing PE has finished its current task, free-
`ing this PE for reuse by another task. After task map-
`ping, the MP sends a notify message to both tasks
`involved in the communication that fired the map-
`ping. This message contains tasks’ current addresses,
`enabling correct packet transmission. Intertask com-
`munications employ general messages, which carry
`data some pairs of tasks need to exchange.
`
`Dynamic mapping algorithms
`The task-mapping problem is directly reducible to
`the quadratic assignment problem, a well-known NP-
`hard problem. Using heuristics to solve this problem
`at runtime is thus mandatory for even moderate-
`sized NoCs. The main cost function of the four
`dynamic mapping heuristics we propose here is the
`reduction of NoC channel congestion.
`Each application has an initial task, which is the
`only one initially mapped when the application starts.
`Remaining application tasks are mapped on demand
`according to communication requests. The mapping
`of initial application tasks uses a clustering strategy
`that partitions the MPSoC into virtual regions called
`clusters, each initially assigned to a single application.
`Only one initial task can be mapped in each cluster,
`thus reducing the probability of tasks that belong to
`different applications sharing the same NoC region.
`This also reduces the sharing of NoC channels by
`communications of distinct applications, in contrast
`to several other mapping strategies.1,5,11 The clustering
`step is executed at system initialization, when specific
`PEs are statically defined to map initial tasks for a set
`of applications. If the number of applications exceeds
`the available clusters, some applications are sched-
`uled to be mapped when resources become available.
`Mapping all tasks at once might provide better
`task-mapping solutions by making it possible to con-
`sider global information about resource utilization
`and the entire application graph structure. However,
`in a dynamic workload scenario, it is not possible
`to define when each task will be needed.
`
`We use two reference mapping methods: first free
`(FF) and nearest neighbor (NN). FF starts at resource
`R00, walking the network column by column, from
`the bottom up. FF selects the first free resource
`according to binding definitions, without considering
`other metrics. FF might generate the worst results
`compared with the other heuristics presented here;
`however, we would expect even worse solutions if
`we used random mappings as a reference.10
`Similar to FF, the NN method does not consider
`other metrics except the proximity of an available re-
`source that can execute the required task. NN starts
`searching for a free node to execute the task near
`the requesting task. The search tests all n-hop neigh-
`bors, n varying spirally between 1 and the NoC limits,
`stopping when it finds the first resource that can exe-
`cute the task.
`The first heuristic proposed here is minimum
`maximum channel load mapping. MMCL evaluates
`all possible mappings for each new task inserted
`in the system. Its goal is to globally minimize chan-
`nel usage peaks, reducing the occurrence of hot
`spots.
`The second heuristic, minimum average channel
`load, aims at reducing the average usage of the NoC
`channels. MACL is similar to MMCL, replacing the max-
`imum with the average function. While the MMCL
`heuristic tries to minimize channel usage peaks,
`MACL tries to homogenously distribute the communi-
`cation load in the NoC. The selected mapping is the
`one resulting in the lowest average channel usage.
`The MMCL and MACL heuristics consider all NoC
`channels while mapping a new task. Since this eval-
`uation can take a long time, the third heuristic, the
`path load (PL) algorithm, considers only the path
`segments used by the task under mapping. A path
`segment groups a set of network channels, defined
`as a function of the task placement and the routing
`algorithm. Given two tasks, A and B, a communica-
`tion path between these tasks is represented by CP
`(A, B). This path is composed of four segments indi-
`vidually taken from east, west, south, and north
`channels. A path segment can be emptyfor exam-
`ple, if the master-slave pair maps to the same col-
`umn or row of PEs. PL computes the cost of each
`mapping k, according to its respective path seg-
`ments. For this purpose, PL adds the rates of the
`new task to the current rates in each channel in CP.
`The selected mapping is the one with minimum
`cost.
`
`28
`
`IEEE Design & Test of Computers
`
`3
`
`

`

`Inputs: A task ti to be mapped, the master task mi, the PE occupation matrix, and the NoC
`channel occupation matrices
`Output: The best PE - best_ pe
`dist 1
`// Initializes # of hops from mi(smallest)
`1.
`cost 1
`2.
`// Initializes cost with highest value
`WHILE cost = 1 DO
`3.
`// Get all mi neighbors within a distance dist
`4.
`pe_list neighbors(dist, mi)
`5.
`FOR ALL ELEMENTS pi IN pe_list
`6.
`7.
`// Tests PE occupation and type (i.e., task binding)
`IF state(pi) = free and type(pi) = type(ti) THEN
`8.
`9.
`// Computes path load cost in NoC channels
`cost_ pl 0
`10.
`channel_list routing_ path(pi, mi)
`11.
`FOR ALL ci IN channel_list
`12.
`cost_ pl cost_ pl + load(ci)
`13.
`14.
`15.
`16.
`17.
`18.
`19.
`20.
`21.
`22.
`23.
`24.
`
`END IF
`END IF
`END FOR
`dist dist + 1 // Go to next range of neighbors
`END WHILE
`return best_ pe
`
`END FOR
`// Set the minimum cost and the target PE
`IF cost_ pl < cost THEN
`cost cost_ pl
`best_ pe pi
`
`Figure 2. Best neighbor (BN) mapping algorithm pseudocode.
`
`PL evaluates all mapping possibilities. The fourth
`heuristic, best neighbor (BN), combines the NN
`search strategy and the PL computation approach.
`The BN search method is similar to NN, in its spiral
`searches from the source node. This avoids comput-
`ing all feasible mapping solutions as in the PL heuris-
`tic, reducing the heuristic’s execution time. BN selects
`the best neighbor, according to PL equations, instead
`of the first free neighbor as in NN.
`To illustrate the heuristics’ behavior, Figure 2
`depicts the BN algorithm. Before executing, BN or an-
`other mapping algorithm must verify the existence of
`free resources compatible with the task to be
`mapped. If such resources are not available, the
`task is scheduled for later execution.
`The algorithm starts by testing the closer
`neighborsthose at dist ¼ 1 hop (line 1). If no feasi-
`ble mapping is found during this iteration, the distance
`is incremented,
`testing a new group of neigh-
`bors (line 22). For this purpose, in each WHILE loop it-
`eration, the function neighbors(dist, mi ) computes
`the list of all PEs placed at dist hops from task mi
`(line 5). Next, for all PEs in this list, it is necessary to
`test the resource occupation (the free or used state)
`
`and its compatibility (whether it is a hardware or a
`software task) with the task to be mapped (line 8).
`BN computes the cost only for feasible mappings.
`The cost computation takes place at lines 11 to 14.
`To obtain the list of channels that comprise the com-
`munication path between mi and pi and back to mi,
`the algorithm invokes function routing_ path (pi, mi )
`(line 11). This function returns a list of channels as
`dictated by the routing algorithm. The path cost
`(cost_ pl ) results from the accumulation of all channel
`loads in the list (lines 12 to 14). If the obtained cost is
`smaller than the previously computed cost, the cost is
`updated, and the same is true for the best PE (lines 16
`to 19). Finally, when it finds at least one PE, the algo-
`rithm exits and returns the best PE.
`The BN algorithm can produce either the PL or NN
`algorithms by using selected simplifications. First,
`eliminating lines 10 to 19 of BN and substituting
`them with line 24 produces the NN algorithm,
`which returns the first neighbor compatible with the
`test in line 8. For producing the PL algorithm from
`BN, the starting point is to suppress the WHILE loop
`(lines 3 to 5, 22, and 23). In addition, the list pe_list
`must contain all the MPSoC’s PEs.
`
`September/October 2010
`
`29
`
`4
`
`

`

`Task Mapping
`
`FF MMCL MACL
`
`NN
`
`PL
`
`BN
`
`Š Scenario C (generic)20 dif-
`ferent generic applications
`produced with the TGFF
`(Task Graphs for Free) tool,
`each application consisting
`of 5 to 10 tasks, with varying
`numbers of hardware and
`software tasks and injection
`rates randomly chosen from
`5% to 30%.
`
`Pipe 10 %
`
`Pipe 15 %
`
`Pipe 20 %
`
`Pipe 25 %
`
`Pipe 30 %
`
`Tree 5 %
`
`Tree 10 %
`
`Tree 15 %
`
`Tree 20 %
`
`Generic
`
`25
`
`20
`
`15
`
`10
`
`05
`
`Pipe 5 %
`
`(% of available bandwidth)
`
`For these three scenarios, we
`adopted an 8  8 heterogeneous
`MPSoC. One node (the router
`with its PE) was the MP, 16 nodes
`were hardware resources, and 47
`nodes were software resources.
`To uniformly spread resources
`over the system area, a set of
`PEs is reserved for initial tasks. The set of experiments
`arbitrarily assumed a maximum of 15 simultaneously
`running applications.
`Figure 3 shows the various heuristics’ average
`channel load (avg.), which represents NoC usage.
`All algorithms reduced average channel load in com-
`parison with the FF method. MMCL and MACL
`reduced channel load by an average of 14% (with re-
`spect to FF), a result inferior to 29.95% achieved with
`the NN algorithm, which did not consider traffic dur-
`ing mapping but explored the proximity of communi-
`cating tasks. The BN heuristic had a gain similar to NN
`(29.74%), and PL displayed the best gain (31.36%).
`Channel load standard deviation (SD) measures
`traffic distribution inside the network. Lower values
`correspond to a homogeneous traffic distribution,
`while higher values suggest that some channels
`have higher loads, and others are not used at all.
`For the simulated scenarios, BN reduced channel
`load SD by 20%, with values similar to the NN algo-
`rithm. Again, PL presented the best results, with a
`gain of 22% over the FF reference mapping.
`Table 1 shows the average packet latency of the pro-
`posed heuristics. Average packet latency depends on
`the distance between source and target PEs and the
`communication path’s congestion. The last line of the
`table shows each algorithm’s percentage gain over FF
`mapping. In most cases, we obtained the best results
`with the PL heuristic. Additionally, PL, BN, and NN
`presented quite similar packet latency values, result-
`ing in similar gains (&15% with respect to FF).
`
`Figure 3. Average channel load (% of available bandwidth) for scenarios A, B, and
`
`C at varying injection rates.
`
`Dynamic mapping evaluation
`To evaluate the proposed dynamic mapping heu-
`ristics, we used a simulation environment
`that
`employs VHDL RTL code for the NoC description.12
`The NoC is a 2D, wormhole, packet-switched mesh ar-
`chitecture that uses XY routing. PEs are described in
`terms of SystemC RTL threads. To hold system usage
`status, one thread describes the MP, containing sched-
`uling queues, channel occupation matrices, and the
`PE occupation matrix. A generic and parameterizable
`thread describes each PE. When a given task is
`mapped to a given PE, this thread reads a configura-
`tion file that describes the task’s communication rates
`and volumes, as well as its execution time. With
`parameterization, PEs can emulate several tasks dur-
`ing system simulation. This strategy allows the addi-
`tion of new applications to the system, as already
`proposed by Chou, Ogras, and Marculescu.10
`For our first set of experiments, we used three sim-
`ulation scenarios:
`Š Scenario A (pipe)20 identical pipeline-like appli-
`cations (such as typical dataflow applications),
`each with 10 tasks (seven software and three hard-
`ware) and injection rates varying from 5% to 30%
`of available channel bandwidth.
`Š Scenario B (tree)20 identical tree-like applica-
`tions (such as typical parallel benchmarks),
`each with 10 tasks (eight software and two hard-
`ware) and injection rates varying from 5%
`to 20%.
`
`30
`
`IEEE Design & Test of Computers
`
`5
`
`

`

`Table 1. Average packet latency (in clock cycles). The lowest latencies are shaded.
`
`Reference mappings
`
`Proposed heuristics
`
`FF
`
`164
`
`269
`
`371
`
`527
`
`656
`
`815
`
`156
`
`271
`
`370
`
`514
`
`NN
`
`MMCL
`
`MACL
`
`141
`
`239
`
`334
`
`431
`
`542
`
`665
`
`143
`
`247
`
`327
`
`433
`
`152
`
`255
`
`350
`
`479
`
`645
`
`781
`
`147
`
`255
`
`349
`
`447
`
`156
`
`258
`
`356
`
`478
`
`579
`
`711
`
`153
`
`246
`
`352
`
`493
`
`313
`
`PL
`
`142
`
`239
`
`328
`
`425
`
`531
`
`673
`
`146
`
`233
`
`329
`
`430
`
`285
`
`BN
`
`147
`
`238
`
`330
`
`450
`
`531
`
`664
`
`146
`
`242
`
`330
`
`434
`
`291
`
`Scenario (injection rate)
`Apipe (5%)
`Apipe (10%)
`Apipe (15%)
`Apipe (20%)
`Apipe (25%)
`Apipe (30%)
`Btree (5%)
`Btree (10%)
`Btree (15%)
`Btree (20%)
`Cgeneric
`Gain with respect to FF (%)
`
`343
`
`
`286
`
`14.98
`
`319
`
`6.15
`
`8.07
`
`15.59
`
`14.66
`
`Table 2. Summary of results for all mapping algorithms normalized to first free ( FF ) results. The best values
`are shaded.
`
`Reference mapping
`
`Proposed heuristic
`
`Performance parameter
`
`Mapping complexity
`(x is NoC width)
`
`Channel load (avg.)
`
`Channel load (SD)
`
`Packet latency (avg.)
`
`Packet latency (SD)
`
`Execution time (vol.)
`Execution time (10  vol.)
`
`FF
`
`O(x 2)
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`NN
`
`O(x 2)
`
`0.70
`
`0.80
`
`0.85
`
`0.66
`
`1.00
`
`0.98
`
`MMCL
`
`O(x 4)
`
`MACL
`
`O(x 4)
`
`PL
`
`O(x 3)
`
`BN
`
`O(x 2)
`
`0.86
`
`0.88
`
`0.94
`
`1.33
`
`1.25
`
`1.00
`
`0.85
`
`0.90
`
`0.92
`
`0.89
`
`1.14
`
`1.00
`
`0.69
`
`0.78
`
`0.84
`
`0.98
`
`1.09
`
`0.99
`
`0.70
`
`0.80
`
`0.85
`
`1.19
`
`1.03
`
`0.99
`
`Packet latency standard deviation is an important
`metric for systems with quality-of-service constraints.
`Packets that originate from a given source must
`keep a regular interval between them to respect the
`injection rate. Latency variation incurs jitter, which
`can lead to packet loss in applications with strict tem-
`poral deadlines such as real-time applications. In sce-
`nario C, all proposed heuristics improved latency SD
`(that is, reduced deviation). However, MMCL and
`MACL presented 10% and 16% gains respectively,
`while BN presented a 14% gain over FF. NN and PL
`reduced packet latency SD by around 21%.
`The last aspect we evaluated is execution time over-
`head. This parameter’s goal is to measure the impact
`of implementing the mapping heuristics on overall
`
`execution time. The complexity of MMCL and MACL
`imposed a large penalty (25% and 13% respectively).
`PL and BN presented execution time overheads of
`8.8% and 2.5% respectively. We considered this an ac-
`ceptable cost, since these heuristics reduced channel
`load. Varying parameters in the performed experi-
`ments showed that a tenfold increase in communica-
`tion volume can cancel this overhead. In that case,
`BN executed in 0.8% less time than FF, and PL in
`1.13% less time than FF.
`Table 2 summarizes the experimental results, nor-
`malized to the FF reference mapping. The PL heuristic
`achieved the best results in most performance figures:
`31% smaller channel usage and 22% smaller packet
`latency. NN executed faster, due to its smaller
`
`September/October 2010
`
`31
`
`6
`
`

`

`the
`feature of
`A relevant
`MPEG-4 graph is that one of the
`13 tasks is connected to eight
`other tasks. The VOPD applica-
`tion has less intertask depen-
`dency than the MPEG-4. The
`Romberg graph has significant
`intertask dependency, with most
`tasks communicating with four
`other tasks.
`We evaluated two simulation
`scenarios:
`
`Tabu search
`4
`3
`8
`5
`11
`1
`12
`9
`10
`Number of hops = 44
`
`76
`
`02
`
`4
`5
`
`11
`10
`5
`9
`8
`7
`1
`0
`Number of hops = 34
`
`3
`2
`
`5
`7
`9
`4
`1
`3
`2
`6
`0
`Number of hops = 52
`
`8
`
`8
`2
`9
`
`12
`11
`10
`
`7
`6
`5
`4
`3
`1
`0
`Number of hops = 46
`
`Task Mapping
`
`Simulated annealing
`
`8
`
`0 4
`
`12
`3
`10
`9
`1
`2
`11
`5
`Number of hops = 40
`
`6 7
`
`5
`
`6
`4
`7
`1
`11
`8
`0
`3
`10
`9
`2
`Number of hops = 26
`
`0
`4
`2
`1
`3
`7
`5
`8
`9
`6
`Number of hops = 36
`
`4
`5
`
`3
`1
`0
`
`6
`2
`7
`9
`8
`12
`10
`11
`Number of hops = 38
`
`Path load/best neighbor
`4
`11
`10
`
`1 9
`
`6
`
`8 7
`
`53
`
`2 1 0
`
`Number of hops = 50
`
`3
`9
`11
`2
`10
`8
`1
`7
`5
`0
`4
`6
`Number of hops = 28
`
`3
`2
`6
`5
`8
`1
`9
`7
`0
`4
`Number of hops = 42
`
`7
`12
`
`6
`10
`11
`
`5
`4
`3
`9
`8
`1
`2
`0
`Number of hops = 40
`
`MPEG-4
`
`MWD
`
`RBERG
`
`VOPD
`
`Š Scenario D (single application
`mapping). This scenario em-
`ployed a 5  4 homogeneous
`MPSoC. The evaluation goal
`was to compare mapping
`algorithms according to differ-
`ent performance figures for a
`single application. Thus, a
`static mapping could explore
`all MPSoC resources without
`the interference of other
`applications.
`Š Scenario E (multiple applica-
`tion mapping). This scenario
`employed a 9  9 homogeneous MPSoC. The
`goal was to compare mapping algorithms for a
`set of simultaneously mapped applications.
`
`Figure 4. Mapping results for scenario D. Each square represents one PE. The
`
`number inside each square identifies the task allocated in that PE.
`
`algorithmic complexity. Note that with increased
`transmission volume, mapping execution time did
`not affect total execution time.
`
`Dynamic versus static mapping
`Our second set of experiments compared both
`static and dynamic mapping alternatives, and we
`have enumerated the advantages and drawbacks of
`each. The experiments investigated two static map-
`ping algorithms originally proposed by Marcon
`et al.,6 simulated annealing (SA) and tabu search
`(TS), and the two dynamic algorithms that presented
`the best results in the experiments we described pre-
`viously, PL and BN.
`The set of graphs used in these experiments con-
`sisted of four synthetic application graphs (obtained
`with the TGFF tool), containing from seven to nine
`tasks, and the following four real application graphs:
`Š MPEG-4 decoder13 tasks;
`Š video object plane decoder (VOPD)13 tasks;
`Š Romberg integration method (RBERG)10 tasks;
`Š multiwindow display (MWD)12 tasks.
`
`Figure 4 presents the mapping of the real applica-
`tion graphs in scenario D for each algorithm. The fig-
`ure shows the number of hops required to execute all
`bidirectional communication between all pairs of
`connected tasks for all mappings. The SA algorithm
`obtained the smallest number of hops for all applica-
`tions, which meant fewer channels were used to exe-
`cute all communications. BN and PL, which had a
`local view of the application, achieved results near
`SA’s results and superior to TS’s. The exception was
`the MPEG-4 application, in which task 1 had 8 neigh-
`bors. For BN and PL, task 1 was mapped near task 0
`(its caller), and the remaining tasks connected to
`task 1 were mapped far from it because of a lack of
`available resources. Algorithms considering the
`whole application graph (SA and TS) mapped
`MPEG-4 task 1 in a more adequate position.
`Table 3 presents the results obtained in scenario D,
`normalized to the SA algorithm. The channel
`
`32
`
`IEEE Design & Test of Computers
`
`7
`
`

`

`Table 3. Summary results for scenario D. The values have been
`normalized with regard to the SA algorithm results.
`
`Dynamic
`
`Static
`
`Performance parameter
`
`Channel occupation (avg.)
`
`Channel occupation (SD)
`
`Packet latency (avg.)
`
`Packet latency (SD)
`
`Number of hops
`
`Total execution time
`
`Energy consumption
`
`PL
`
`1.07
`
`0.97
`
`1.01
`
`0.99
`
`1.14
`
`1.04
`
`1.38
`
`Best neighbor - (number of hops = 264)
`A1
`A3
`
`A4
`
`A4
`
`A2
`
`A0
`
`A6
`
`A3
`
`A4
`
`A0
`
`A1
`MP
`
`SA
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`TS
`
`0.96
`
`0.94
`
`1.01
`
`1.00
`
`1.26
`
`1.01
`
`1.11
`
`Single task of
`application A0
`(fragmentation)
`
`BN
`
`1.06
`
`0.92
`
`1.01
`
`0.99
`
`1.14
`
`1.03
`
`1.38
`
`A0
`
`A3
`
`A6
`
`A2
`
`occupation performance parameters (average and
`SD) represent NoC usage. The cost overhead of the
`dynamic algorithms was low, approximately 6% for
`the average NoC occupation. The cost of the average
`packet transmission latency was even lower, 1%. The
`total number of hops increased 14% for the dynamic
`approaches.
`Concerning total execution time, Table 3 shows
`that the values obtained with the BN and PL algo-
`rithms were on average only 4% worse than SA’s re-
`sult. Note that the time for mapping each task is
`considered when dynamic algorithms execute. Be-
`cause static mapping algorithms execute at design
`time, there was no mapping overhead during applica-
`tion execution.
`The last line of Table 3 presents average energy
`consumption, estimated according to Marcon et al.6
`Dynamic mapping algorithms were penalized in this
`performance parameter by the additional network
`hops involved in the communication. MPEG-4 was
`the application that most penalized dynamic map-
`ping algorithms. If this application were not included
`in the average, the energy consumption penalty
`would be smaller (17%). This comparison shows
`that dynamic mapping has smaller costs than static
`mapping when application graphs don’t have
`strongly connected tasks. Situations like that of the
`MPEG-4 benchmark are not commonplace in actual
`applications.
`Mapping quality can be estimated as a function of
`application area fragmentation. Figure 5 shows the
`application area distribution for the BN mapping.
`BN and PL generated more-isolated applications.
`The smaller fragmentation observed in dynamic
`mapping came from the clustering strategy, which
`increased the size of contiguous blocks.
`Table 4 presents the results for scenario E, normal-
`ized as a function of the TS algorithm (for larger
`benchmarks, TS outperforms SA). The overall over-
`head imposed by dynamic mapping in such a com-
`plex scenario was small, as shown in columns PL
`and BN: 7% and 13% in average channel occupation,
`7% and 10% in average packet latency, 4% and 3% in
`total execution time, and 18% and 13% in communi-
`cation energy consumption.
`
`A2
`
`A7
`
`A2
`
`A0
`
`A5
`
`A5
`
`A3
`
`This PE is shared
`by A2 and A5
`(reuse)
`Application A7 in
`a contiguous area
`
`Free
`resource
`
`Figure 5. Application distribution for the BN algorithm in
`scenario E (An ⴝ applications; MP ⴝ manager processor). Each
`distinct shade of gray corresponds to a distinct application.
`
`Table 4. Summary of results for scenario E, normalized with regard
`to the TS algorithm.
`
`Dynamic
`
`Static
`
`Performance parameter
`
`Channel occupation (avg.)
`
`Channel occupation (SD)
`
`Packet latency (avg.)
`
`Packet latency (SD)
`
`Number of hops
`
`Total execution time
`
`Energy consumption
`
`PL
`
`1.07
`
`1.18
`
`1.07
`
`1.03
`
`0.93
`
`1.04
`
`1.18
`
`BN
`
`1.13
`
`1.27
`
`1.10
`
`1.11
`
`0.92
`
`1.03
`
`1.13
`
`SA
`
`1.13
`
`1.36
`
`1.16
`
`1.05
`
`1.59
`
`1.02
`
`1.35
`
`TS
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`MPSOCS CONTAINING several tens of processors have
`recently been launched as commercial products. To
`function effectively, these complex MPSoCs must incor-
`porate distributed or centralized PE management.
`
`Such management includes, for example, frequency
`and voltage control, processor and network load mon-
`itoring, and task mapping. The research described
`here contributes to the last item, task mapping.
`
`September/October 2010
`
`33
`
`8
`
`

`

`Task Mapping
`
`According to our first set of experiments, the PL
`heuristic is the best solution. Compared to ad hoc
`mapping, the proposed algorithms allow up to 31%
`smaller channel load and up to 22% smaller packet
`latency. Additionally, the heuristics present an accept-
`able overhead cost. Local decisions taken by the algo-
`rithms make MMCL and MACL ineffective. When a
`new mapping does not reduce the average link
`load (or maximum load), these algorithms select
`the first available mapping option. NN and BN display
`some improvements but with gains inferior to PL.
`Our second set of experiments shows that the cost
`overhead of dynamic mapping compared to static
`mapping for a complex scenario was on average
`10% in channel occupation, 8.5% in latency, 3.5% in
`total execution time, and 15.5% in communication
`energy consumption. This is an acceptable overhead,
`considering the advantages offered by dynamic
`mapping:
`
`Š Smaller systems can be used, since only tasks
`being executed need be mapped into the system.
`Š The number of tasks can be greater than available
`system resources.
`Š The inclusion of new applications after design
`extends the MPSoC’s usability.
`
`Dynamic mapping’s main weakness is that it takes
`a partial view of the application graph, considering
`only the intercommunication of
`the task being
`mapped and its caller task. On the other hand, static
`mapping considers all tasks and resources together,
`using algorithms more expensive in terms of execu-
`tion time. In future research, we plan to take several
`directions to improve the performance of dynamic
`mapping. For example, when a given task starts its ex-
`ecution, the MP could start the mapping of all its slave
`tasks or reserve resources for them. A new heuristic
`could adopt a task migration strategy when communi-
`cation cost becomes too high for a given task. Finally

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket