`
`Dynamic Task Mapping
`for MPSoCs
`
`Ewerson Luiz de Souza Carvalho, Ney Laert Vilar Calazans,
`and Fernando Gehm Moraes
`Pontifı´cia Universidade Cato´ lica do Rio Grande do Sul
`
`Multiprocessor-system-on-a-chip (MPSoC) applications can consist of a vary-
`ing number of simultaneous tasks and can change even after system design,
`enforcing a scenario that requires the use of dynamic task mapping. This article
`investigates dynamic task-mapping heuristics targeting reduction of network
`congestion in network-on-chip (NoC)-based MPSoCs. The proposed heuristics
`achieve up to 31% smaller channel load and up to 22% smaller packet latency
`than other heuristics.
`
`In this article, we investigate the per-
`formance of mapping algorithms for
`heterogeneous MPSoCs targeting dy-
`namic workloads. Our main objective
`is to minimize congestion inside the
`network on chip (NoC) through chan-
`nel usage optimization. The perfor-
`mance parameters we look at include
`execution time, NoC channel
`load,
`and packet latency. We also compare static and dy-
`namic mapping approaches in terms of their advan-
`tages and drawbacks in application scenarios that
`we simulated in two sets of experiments.
`
`Static mapping techniques
`Static mapping techniques are suitable for specific
`platforms that don’t need the insertion of new applica-
`tions into the system at runtime. Because the mapping
`algorithm’s running time is not a main concern, static
`mapping can use thorough information about the sys-
`tem to make decisions. This information includes the
`application graph’s topology, volume and rate of inter-
`task communication, and total system usage.
`Research into static mapping includes the work of
`Lei and Kumar, who proposed a genetic mapping al-
`gorithm to optimize application execution time.1 In
`their work, graphs represent applications and the tar-
`get architecture is an NoC. Wu, Al-Hashimi, and Eles
`also investigated genetic mapping algorithms.2 By
`combining dynamic voltage scaling techniques with
`mapping, they achieved a 51% savings in energy con-
`sumption. Murali and De Micheli explored mappings
`for more than one application in NoC design, using
`the tabu search (TS) algorithm.3 Manolache, Eles,
`and Peng investigated task mapping in NoCs, trying
`to guarantee packet latency.4 For this purpose, both
`the task-mapping algorithm (TS) and the routing algo-
`rithm are defined at design time. Hu and Marculescu
`
`THE EVOLUTION OF deep-submicron technology
`has dramatically increased IC density, enabling the
`development of multiprocessor systems on a chip
`(MPSoCs), SoCs containing multiple processing ele-
`ments (PEs). A homogeneous MPSoC contains a set
`of identical PEs, typically programmable processors.
`Whereas homogeneous MPSoCs simplify task migra-
`tion, heterogeneous MPSoCs support a wider variety
`of applications because they integrate distinct PEs.
`Recently, Intel developed a homogeneous MPSoC
`with 80 PEs, following a trend of growing MPSoC
`complexity.
`Applications (such as multimedia and networking)
`running in MPSoCs generally present a dynamic task
`workload. This implies a varying number of tasks run-
`ning at any given moment, possibly exceeding the
`available hardware resources. Thus, it is necessary to
`control task operation and system resource use, includ-
`ing dynamic management of task-loading actions.
`Task mapping which consists of finding a place-
`ment for a set of tasks that meets a specific require-
`ment such as energy consumption savings is an
`important issue. Mapping decisions can drastically in-
`fluence system performance. If we use the moment at
`which a task is defined as a classification criterion,
`task-mapping approaches can be either static or dy-
`namic. Static mapping defines task placement at de-
`sign time; dynamic mapping defines task placement
`at runtime.
`
`0740-7475/10/$26.00c 2010 IEEE
`
`26
`
`Copublished by the IEEE CS and the IEEE CASS
`
`IEEE Design & Test of Computers
`
`1
`
`MICROSOFT 1008
`
`
`
`RL
`
`ISP
`
`RL
`
`IP
`
`R
`
`R
`
`R
`
`...
`ISP
`...
`ISP
`...
`ISP
`....
`R
`ISP
`
`..
`
`R
`
`R
`
`R
`
`...
`
`R
`
`ISP
`
`ISP
`
`IP
`
`ISP
`
`R
`
`R
`
`R
`
`...
`
`R
`
`IP
`
`RL
`
`ISP
`
`RL
`
`R
`
`R
`
`R
`
`...
`
`R
`
`Manager processor (MP)
`
`Resource control
`
`Task scheduling
`
`Task binding
`
`Task mapping
`
`Configuration
`control
`
`Task migration
`
`Task memory
`
`Device
`configuration port
`
`presented a branch-and-bound
`algorithm to map a set of
`IP
`cores (IPs) into an NoC with
`bandwidth reservation.5 Their
`results show energy savings of
`51.7% in the communication ar-
`chitecture. Marcon et al. investi-
`gated how to map modules into
`an NoC, targeting low energy
`consumption.6 They compared
`several
`algorithms, using
`a
`model that characterizes appli-
`cations by their intertask com-
`munication volume.
`
`Figure 1. Conceptual MPSoC organization. The shaded blocks inside the MP are
`
`the modules addressed in this article. The Rs are NoC routers.
`
`Dynamic mapping
`techniques
`In dynamic mapping, the time taken to map each
`task is crucial, since it influences overall application
`execution time. To reduce mapping overhead, it is im-
`portant to use greedy algorithms because they can
`trade search space exploration quality for fast results.
`Although execution time for mapping is an important
`cost function, several works don’t consider this over-
`head.7,8 These researchers assumed that overall sys-
`tem execution time compensates for the time spent
`mapping each task. On this assumption, many
`researchers used the same strategies in dynamic sce-
`narios as those used in static scenarios. For example,
`Murali and De Micheli used simulated annealing (SA)
`algorithms for static mapping, and Ngouanga et al.
`used SA for dynamic mapping.3,7
`Smit, Hurink, and Smit presented an iterative hier-
`archical approach to mapping an application to an
`NoC-based SoC at runtime.8 A set of communicating
`tasks models the application. Aiming at energy con-
`sumption savings, the mapping algorithm tries to
`place each task near its communicating PEs. Similarly,
`besides the SA algorithm, Ngouanga et al. proposed a
`force-directed mapping algorithm, which aims to keep
`communicating tasks close to each other.7 Mehran,
`Khademzadeh, and Saeidi presented a mapping
`algorithm that searches for a placement following
`a spiral path, tending to place communicating tasks
`together.9 This method executes for a single applica-
`tion without cost function evaluation. Chou, Ogras,
`and Marculescu combined task-mapping and multi-
`ple-voltage-level schemes, allowing reduction of com-
`munication overhead and energy.10 They achieved
`a communication energy savings of up to 50%.
`
`Al Faruque, Krist, and Henkel suggested a distributed
`agent-based mapping approach for larger MPSoCs,
`applying it to a 32 64 MPSoC.11 The authors claim
`to achieve on average a 7.1 times lower computational
`effort for the mapping algorithm than with a simple
`nearest-neighbor heuristic.
`
`Target MPSoC organization
`A heterogeneous MPSoC is a set of different PEs
`interacting through a communication network. In
`our proposed NoC-based heterogeneous MPSoC
`model, each PE supports the execution of one task
`at a time. PEs can support either hardware or software
`task execution. Software tasks execute in instruction
`set processors (ISPs), and hardware tasks in reconfig-
`urable logic (RL) or dedicated IPs.
`Figure 1 illustrates the proposed heterogeneous
`MPSoC organization. Among the available resources,
`one processor the manager processor (MP) is re-
`sponsible for resource control, configuration control,
`and four task manipulation issues: scheduling, bind-
`ing, mapping, and migration. The MP fires each appli-
`cation’s initial task. When a running task needs to
`communicate with another task not yet mapped,
`the latter is loaded in a PE from task memory.
`A directed graph, in which vertices represent soft-
`ware or hardware tasks and edges define communi-
`cating PE pairs, models each application. Each edge
`defines a master-slave communication. The edge
`source vertex is the master of the communication,
`and the other is the slave. Each edge receives four
`parameters, which represent the volume and rate of
`data sent from master to slave and from slave to mas-
`ter. Similar models are common in several mapping
`algorithms.5,6,10 However, unlike our algorithm, these
`
`September/October 2010
`
`27
`
`2
`
`
`
`Task Mapping
`
`algorithms consider only the communication volume
`for the application characterization graph (ACG),5
`ignoring transmission rate information.
`Intertask communications use messages transmit-
`ted through the network. Four message types com-
`prise the adopted communication protocol. A
`request message from a running task to the MP con-
`tains the identification of a new task to be inserted
`in the system. A release message notifies the MP
`that a processing PE has finished its current task, free-
`ing this PE for reuse by another task. After task map-
`ping, the MP sends a notify message to both tasks
`involved in the communication that fired the map-
`ping. This message contains tasks’ current addresses,
`enabling correct packet transmission. Intertask com-
`munications employ general messages, which carry
`data some pairs of tasks need to exchange.
`
`Dynamic mapping algorithms
`The task-mapping problem is directly reducible to
`the quadratic assignment problem, a well-known NP-
`hard problem. Using heuristics to solve this problem
`at runtime is thus mandatory for even moderate-
`sized NoCs. The main cost function of the four
`dynamic mapping heuristics we propose here is the
`reduction of NoC channel congestion.
`Each application has an initial task, which is the
`only one initially mapped when the application starts.
`Remaining application tasks are mapped on demand
`according to communication requests. The mapping
`of initial application tasks uses a clustering strategy
`that partitions the MPSoC into virtual regions called
`clusters, each initially assigned to a single application.
`Only one initial task can be mapped in each cluster,
`thus reducing the probability of tasks that belong to
`different applications sharing the same NoC region.
`This also reduces the sharing of NoC channels by
`communications of distinct applications, in contrast
`to several other mapping strategies.1,5,11 The clustering
`step is executed at system initialization, when specific
`PEs are statically defined to map initial tasks for a set
`of applications. If the number of applications exceeds
`the available clusters, some applications are sched-
`uled to be mapped when resources become available.
`Mapping all tasks at once might provide better
`task-mapping solutions by making it possible to con-
`sider global information about resource utilization
`and the entire application graph structure. However,
`in a dynamic workload scenario, it is not possible
`to define when each task will be needed.
`
`We use two reference mapping methods: first free
`(FF) and nearest neighbor (NN). FF starts at resource
`R00, walking the network column by column, from
`the bottom up. FF selects the first free resource
`according to binding definitions, without considering
`other metrics. FF might generate the worst results
`compared with the other heuristics presented here;
`however, we would expect even worse solutions if
`we used random mappings as a reference.10
`Similar to FF, the NN method does not consider
`other metrics except the proximity of an available re-
`source that can execute the required task. NN starts
`searching for a free node to execute the task near
`the requesting task. The search tests all n-hop neigh-
`bors, n varying spirally between 1 and the NoC limits,
`stopping when it finds the first resource that can exe-
`cute the task.
`The first heuristic proposed here is minimum
`maximum channel load mapping. MMCL evaluates
`all possible mappings for each new task inserted
`in the system. Its goal is to globally minimize chan-
`nel usage peaks, reducing the occurrence of hot
`spots.
`The second heuristic, minimum average channel
`load, aims at reducing the average usage of the NoC
`channels. MACL is similar to MMCL, replacing the max-
`imum with the average function. While the MMCL
`heuristic tries to minimize channel usage peaks,
`MACL tries to homogenously distribute the communi-
`cation load in the NoC. The selected mapping is the
`one resulting in the lowest average channel usage.
`The MMCL and MACL heuristics consider all NoC
`channels while mapping a new task. Since this eval-
`uation can take a long time, the third heuristic, the
`path load (PL) algorithm, considers only the path
`segments used by the task under mapping. A path
`segment groups a set of network channels, defined
`as a function of the task placement and the routing
`algorithm. Given two tasks, A and B, a communica-
`tion path between these tasks is represented by CP
`(A, B). This path is composed of four segments indi-
`vidually taken from east, west, south, and north
`channels. A path segment can be empty for exam-
`ple, if the master-slave pair maps to the same col-
`umn or row of PEs. PL computes the cost of each
`mapping k, according to its respective path seg-
`ments. For this purpose, PL adds the rates of the
`new task to the current rates in each channel in CP.
`The selected mapping is the one with minimum
`cost.
`
`28
`
`IEEE Design & Test of Computers
`
`3
`
`
`
`Inputs: A task ti to be mapped, the master task mi, the PE occupation matrix, and the NoC
`channel occupation matrices
`Output: The best PE - best_ pe
`dist 1
`// Initializes # of hops from mi(smallest)
`1.
`cost 1
`2.
`// Initializes cost with highest value
`WHILE cost = 1 DO
`3.
`// Get all mi neighbors within a distance dist
`4.
`pe_list neighbors(dist, mi)
`5.
`FOR ALL ELEMENTS pi IN pe_list
`6.
`7.
`// Tests PE occupation and type (i.e., task binding)
`IF state(pi) = free and type(pi) = type(ti) THEN
`8.
`9.
`// Computes path load cost in NoC channels
`cost_ pl 0
`10.
`channel_list routing_ path(pi, mi)
`11.
`FOR ALL ci IN channel_list
`12.
`cost_ pl cost_ pl + load(ci)
`13.
`14.
`15.
`16.
`17.
`18.
`19.
`20.
`21.
`22.
`23.
`24.
`
`END IF
`END IF
`END FOR
`dist dist + 1 // Go to next range of neighbors
`END WHILE
`return best_ pe
`
`END FOR
`// Set the minimum cost and the target PE
`IF cost_ pl < cost THEN
`cost cost_ pl
`best_ pe pi
`
`Figure 2. Best neighbor (BN) mapping algorithm pseudocode.
`
`PL evaluates all mapping possibilities. The fourth
`heuristic, best neighbor (BN), combines the NN
`search strategy and the PL computation approach.
`The BN search method is similar to NN, in its spiral
`searches from the source node. This avoids comput-
`ing all feasible mapping solutions as in the PL heuris-
`tic, reducing the heuristic’s execution time. BN selects
`the best neighbor, according to PL equations, instead
`of the first free neighbor as in NN.
`To illustrate the heuristics’ behavior, Figure 2
`depicts the BN algorithm. Before executing, BN or an-
`other mapping algorithm must verify the existence of
`free resources compatible with the task to be
`mapped. If such resources are not available, the
`task is scheduled for later execution.
`The algorithm starts by testing the closer
`neighbors those at dist ¼ 1 hop (line 1). If no feasi-
`ble mapping is found during this iteration, the distance
`is incremented,
`testing a new group of neigh-
`bors (line 22). For this purpose, in each WHILE loop it-
`eration, the function neighbors(dist, mi ) computes
`the list of all PEs placed at dist hops from task mi
`(line 5). Next, for all PEs in this list, it is necessary to
`test the resource occupation (the free or used state)
`
`and its compatibility (whether it is a hardware or a
`software task) with the task to be mapped (line 8).
`BN computes the cost only for feasible mappings.
`The cost computation takes place at lines 11 to 14.
`To obtain the list of channels that comprise the com-
`munication path between mi and pi and back to mi,
`the algorithm invokes function routing_ path (pi, mi )
`(line 11). This function returns a list of channels as
`dictated by the routing algorithm. The path cost
`(cost_ pl ) results from the accumulation of all channel
`loads in the list (lines 12 to 14). If the obtained cost is
`smaller than the previously computed cost, the cost is
`updated, and the same is true for the best PE (lines 16
`to 19). Finally, when it finds at least one PE, the algo-
`rithm exits and returns the best PE.
`The BN algorithm can produce either the PL or NN
`algorithms by using selected simplifications. First,
`eliminating lines 10 to 19 of BN and substituting
`them with line 24 produces the NN algorithm,
`which returns the first neighbor compatible with the
`test in line 8. For producing the PL algorithm from
`BN, the starting point is to suppress the WHILE loop
`(lines 3 to 5, 22, and 23). In addition, the list pe_list
`must contain all the MPSoC’s PEs.
`
`September/October 2010
`
`29
`
`4
`
`
`
`Task Mapping
`
`FF MMCL MACL
`
`NN
`
`PL
`
`BN
`
` Scenario C (generic) 20 dif-
`ferent generic applications
`produced with the TGFF
`(Task Graphs for Free) tool,
`each application consisting
`of 5 to 10 tasks, with varying
`numbers of hardware and
`software tasks and injection
`rates randomly chosen from
`5% to 30%.
`
`Pipe 10 %
`
`Pipe 15 %
`
`Pipe 20 %
`
`Pipe 25 %
`
`Pipe 30 %
`
`Tree 5 %
`
`Tree 10 %
`
`Tree 15 %
`
`Tree 20 %
`
`Generic
`
`25
`
`20
`
`15
`
`10
`
`05
`
`Pipe 5 %
`
`(% of available bandwidth)
`
`For these three scenarios, we
`adopted an 8 8 heterogeneous
`MPSoC. One node (the router
`with its PE) was the MP, 16 nodes
`were hardware resources, and 47
`nodes were software resources.
`To uniformly spread resources
`over the system area, a set of
`PEs is reserved for initial tasks. The set of experiments
`arbitrarily assumed a maximum of 15 simultaneously
`running applications.
`Figure 3 shows the various heuristics’ average
`channel load (avg.), which represents NoC usage.
`All algorithms reduced average channel load in com-
`parison with the FF method. MMCL and MACL
`reduced channel load by an average of 14% (with re-
`spect to FF), a result inferior to 29.95% achieved with
`the NN algorithm, which did not consider traffic dur-
`ing mapping but explored the proximity of communi-
`cating tasks. The BN heuristic had a gain similar to NN
`(29.74%), and PL displayed the best gain (31.36%).
`Channel load standard deviation (SD) measures
`traffic distribution inside the network. Lower values
`correspond to a homogeneous traffic distribution,
`while higher values suggest that some channels
`have higher loads, and others are not used at all.
`For the simulated scenarios, BN reduced channel
`load SD by 20%, with values similar to the NN algo-
`rithm. Again, PL presented the best results, with a
`gain of 22% over the FF reference mapping.
`Table 1 shows the average packet latency of the pro-
`posed heuristics. Average packet latency depends on
`the distance between source and target PEs and the
`communication path’s congestion. The last line of the
`table shows each algorithm’s percentage gain over FF
`mapping. In most cases, we obtained the best results
`with the PL heuristic. Additionally, PL, BN, and NN
`presented quite similar packet latency values, result-
`ing in similar gains (&15% with respect to FF).
`
`Figure 3. Average channel load (% of available bandwidth) for scenarios A, B, and
`
`C at varying injection rates.
`
`Dynamic mapping evaluation
`To evaluate the proposed dynamic mapping heu-
`ristics, we used a simulation environment
`that
`employs VHDL RTL code for the NoC description.12
`The NoC is a 2D, wormhole, packet-switched mesh ar-
`chitecture that uses XY routing. PEs are described in
`terms of SystemC RTL threads. To hold system usage
`status, one thread describes the MP, containing sched-
`uling queues, channel occupation matrices, and the
`PE occupation matrix. A generic and parameterizable
`thread describes each PE. When a given task is
`mapped to a given PE, this thread reads a configura-
`tion file that describes the task’s communication rates
`and volumes, as well as its execution time. With
`parameterization, PEs can emulate several tasks dur-
`ing system simulation. This strategy allows the addi-
`tion of new applications to the system, as already
`proposed by Chou, Ogras, and Marculescu.10
`For our first set of experiments, we used three sim-
`ulation scenarios:
` Scenario A (pipe) 20 identical pipeline-like appli-
`cations (such as typical dataflow applications),
`each with 10 tasks (seven software and three hard-
`ware) and injection rates varying from 5% to 30%
`of available channel bandwidth.
` Scenario B (tree) 20 identical tree-like applica-
`tions (such as typical parallel benchmarks),
`each with 10 tasks (eight software and two hard-
`ware) and injection rates varying from 5%
`to 20%.
`
`30
`
`IEEE Design & Test of Computers
`
`5
`
`
`
`Table 1. Average packet latency (in clock cycles). The lowest latencies are shaded.
`
`Reference mappings
`
`Proposed heuristics
`
`FF
`
`164
`
`269
`
`371
`
`527
`
`656
`
`815
`
`156
`
`271
`
`370
`
`514
`
`NN
`
`MMCL
`
`MACL
`
`141
`
`239
`
`334
`
`431
`
`542
`
`665
`
`143
`
`247
`
`327
`
`433
`
`152
`
`255
`
`350
`
`479
`
`645
`
`781
`
`147
`
`255
`
`349
`
`447
`
`156
`
`258
`
`356
`
`478
`
`579
`
`711
`
`153
`
`246
`
`352
`
`493
`
`313
`
`PL
`
`142
`
`239
`
`328
`
`425
`
`531
`
`673
`
`146
`
`233
`
`329
`
`430
`
`285
`
`BN
`
`147
`
`238
`
`330
`
`450
`
`531
`
`664
`
`146
`
`242
`
`330
`
`434
`
`291
`
`Scenario (injection rate)
`A pipe (5%)
`A pipe (10%)
`A pipe (15%)
`A pipe (20%)
`A pipe (25%)
`A pipe (30%)
`B tree (5%)
`B tree (10%)
`B tree (15%)
`B tree (20%)
`C generic
`Gain with respect to FF (%)
`
`343
`
`
`286
`
`14.98
`
`319
`
`6.15
`
`8.07
`
`15.59
`
`14.66
`
`Table 2. Summary of results for all mapping algorithms normalized to first free ( FF ) results. The best values
`are shaded.
`
`Reference mapping
`
`Proposed heuristic
`
`Performance parameter
`
`Mapping complexity
`(x is NoC width)
`
`Channel load (avg.)
`
`Channel load (SD)
`
`Packet latency (avg.)
`
`Packet latency (SD)
`
`Execution time (vol.)
`Execution time (10 vol.)
`
`FF
`
`O(x 2)
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`NN
`
`O(x 2)
`
`0.70
`
`0.80
`
`0.85
`
`0.66
`
`1.00
`
`0.98
`
`MMCL
`
`O(x 4)
`
`MACL
`
`O(x 4)
`
`PL
`
`O(x 3)
`
`BN
`
`O(x 2)
`
`0.86
`
`0.88
`
`0.94
`
`1.33
`
`1.25
`
`1.00
`
`0.85
`
`0.90
`
`0.92
`
`0.89
`
`1.14
`
`1.00
`
`0.69
`
`0.78
`
`0.84
`
`0.98
`
`1.09
`
`0.99
`
`0.70
`
`0.80
`
`0.85
`
`1.19
`
`1.03
`
`0.99
`
`Packet latency standard deviation is an important
`metric for systems with quality-of-service constraints.
`Packets that originate from a given source must
`keep a regular interval between them to respect the
`injection rate. Latency variation incurs jitter, which
`can lead to packet loss in applications with strict tem-
`poral deadlines such as real-time applications. In sce-
`nario C, all proposed heuristics improved latency SD
`(that is, reduced deviation). However, MMCL and
`MACL presented 10% and 16% gains respectively,
`while BN presented a 14% gain over FF. NN and PL
`reduced packet latency SD by around 21%.
`The last aspect we evaluated is execution time over-
`head. This parameter’s goal is to measure the impact
`of implementing the mapping heuristics on overall
`
`execution time. The complexity of MMCL and MACL
`imposed a large penalty (25% and 13% respectively).
`PL and BN presented execution time overheads of
`8.8% and 2.5% respectively. We considered this an ac-
`ceptable cost, since these heuristics reduced channel
`load. Varying parameters in the performed experi-
`ments showed that a tenfold increase in communica-
`tion volume can cancel this overhead. In that case,
`BN executed in 0.8% less time than FF, and PL in
`1.13% less time than FF.
`Table 2 summarizes the experimental results, nor-
`malized to the FF reference mapping. The PL heuristic
`achieved the best results in most performance figures:
`31% smaller channel usage and 22% smaller packet
`latency. NN executed faster, due to its smaller
`
`September/October 2010
`
`31
`
`6
`
`
`
`the
`feature of
`A relevant
`MPEG-4 graph is that one of the
`13 tasks is connected to eight
`other tasks. The VOPD applica-
`tion has less intertask depen-
`dency than the MPEG-4. The
`Romberg graph has significant
`intertask dependency, with most
`tasks communicating with four
`other tasks.
`We evaluated two simulation
`scenarios:
`
`Tabu search
`4
`3
`8
`5
`11
`1
`12
`9
`10
`Number of hops = 44
`
`76
`
`02
`
`4
`5
`
`11
`10
`5
`9
`8
`7
`1
`0
`Number of hops = 34
`
`3
`2
`
`5
`7
`9
`4
`1
`3
`2
`6
`0
`Number of hops = 52
`
`8
`
`8
`2
`9
`
`12
`11
`10
`
`7
`6
`5
`4
`3
`1
`0
`Number of hops = 46
`
`Task Mapping
`
`Simulated annealing
`
`8
`
`0 4
`
`12
`3
`10
`9
`1
`2
`11
`5
`Number of hops = 40
`
`6 7
`
`5
`
`6
`4
`7
`1
`11
`8
`0
`3
`10
`9
`2
`Number of hops = 26
`
`0
`4
`2
`1
`3
`7
`5
`8
`9
`6
`Number of hops = 36
`
`4
`5
`
`3
`1
`0
`
`6
`2
`7
`9
`8
`12
`10
`11
`Number of hops = 38
`
`Path load/best neighbor
`4
`11
`10
`
`1 9
`
`6
`
`8 7
`
`53
`
`2 1 0
`
`Number of hops = 50
`
`3
`9
`11
`2
`10
`8
`1
`7
`5
`0
`4
`6
`Number of hops = 28
`
`3
`2
`6
`5
`8
`1
`9
`7
`0
`4
`Number of hops = 42
`
`7
`12
`
`6
`10
`11
`
`5
`4
`3
`9
`8
`1
`2
`0
`Number of hops = 40
`
`MPEG-4
`
`MWD
`
`RBERG
`
`VOPD
`
` Scenario D (single application
`mapping). This scenario em-
`ployed a 5 4 homogeneous
`MPSoC. The evaluation goal
`was to compare mapping
`algorithms according to differ-
`ent performance figures for a
`single application. Thus, a
`static mapping could explore
`all MPSoC resources without
`the interference of other
`applications.
` Scenario E (multiple applica-
`tion mapping). This scenario
`employed a 9 9 homogeneous MPSoC. The
`goal was to compare mapping algorithms for a
`set of simultaneously mapped applications.
`
`Figure 4. Mapping results for scenario D. Each square represents one PE. The
`
`number inside each square identifies the task allocated in that PE.
`
`algorithmic complexity. Note that with increased
`transmission volume, mapping execution time did
`not affect total execution time.
`
`Dynamic versus static mapping
`Our second set of experiments compared both
`static and dynamic mapping alternatives, and we
`have enumerated the advantages and drawbacks of
`each. The experiments investigated two static map-
`ping algorithms originally proposed by Marcon
`et al.,6 simulated annealing (SA) and tabu search
`(TS), and the two dynamic algorithms that presented
`the best results in the experiments we described pre-
`viously, PL and BN.
`The set of graphs used in these experiments con-
`sisted of four synthetic application graphs (obtained
`with the TGFF tool), containing from seven to nine
`tasks, and the following four real application graphs:
` MPEG-4 decoder 13 tasks;
` video object plane decoder (VOPD) 13 tasks;
` Romberg integration method (RBERG) 10 tasks;
` multiwindow display (MWD) 12 tasks.
`
`Figure 4 presents the mapping of the real applica-
`tion graphs in scenario D for each algorithm. The fig-
`ure shows the number of hops required to execute all
`bidirectional communication between all pairs of
`connected tasks for all mappings. The SA algorithm
`obtained the smallest number of hops for all applica-
`tions, which meant fewer channels were used to exe-
`cute all communications. BN and PL, which had a
`local view of the application, achieved results near
`SA’s results and superior to TS’s. The exception was
`the MPEG-4 application, in which task 1 had 8 neigh-
`bors. For BN and PL, task 1 was mapped near task 0
`(its caller), and the remaining tasks connected to
`task 1 were mapped far from it because of a lack of
`available resources. Algorithms considering the
`whole application graph (SA and TS) mapped
`MPEG-4 task 1 in a more adequate position.
`Table 3 presents the results obtained in scenario D,
`normalized to the SA algorithm. The channel
`
`32
`
`IEEE Design & Test of Computers
`
`7
`
`
`
`Table 3. Summary results for scenario D. The values have been
`normalized with regard to the SA algorithm results.
`
`Dynamic
`
`Static
`
`Performance parameter
`
`Channel occupation (avg.)
`
`Channel occupation (SD)
`
`Packet latency (avg.)
`
`Packet latency (SD)
`
`Number of hops
`
`Total execution time
`
`Energy consumption
`
`PL
`
`1.07
`
`0.97
`
`1.01
`
`0.99
`
`1.14
`
`1.04
`
`1.38
`
`Best neighbor - (number of hops = 264)
`A1
`A3
`
`A4
`
`A4
`
`A2
`
`A0
`
`A6
`
`A3
`
`A4
`
`A0
`
`A1
`MP
`
`SA
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`TS
`
`0.96
`
`0.94
`
`1.01
`
`1.00
`
`1.26
`
`1.01
`
`1.11
`
`Single task of
`application A0
`(fragmentation)
`
`BN
`
`1.06
`
`0.92
`
`1.01
`
`0.99
`
`1.14
`
`1.03
`
`1.38
`
`A0
`
`A3
`
`A6
`
`A2
`
`occupation performance parameters (average and
`SD) represent NoC usage. The cost overhead of the
`dynamic algorithms was low, approximately 6% for
`the average NoC occupation. The cost of the average
`packet transmission latency was even lower, 1%. The
`total number of hops increased 14% for the dynamic
`approaches.
`Concerning total execution time, Table 3 shows
`that the values obtained with the BN and PL algo-
`rithms were on average only 4% worse than SA’s re-
`sult. Note that the time for mapping each task is
`considered when dynamic algorithms execute. Be-
`cause static mapping algorithms execute at design
`time, there was no mapping overhead during applica-
`tion execution.
`The last line of Table 3 presents average energy
`consumption, estimated according to Marcon et al.6
`Dynamic mapping algorithms were penalized in this
`performance parameter by the additional network
`hops involved in the communication. MPEG-4 was
`the application that most penalized dynamic map-
`ping algorithms. If this application were not included
`in the average, the energy consumption penalty
`would be smaller (17%). This comparison shows
`that dynamic mapping has smaller costs than static
`mapping when application graphs don’t have
`strongly connected tasks. Situations like that of the
`MPEG-4 benchmark are not commonplace in actual
`applications.
`Mapping quality can be estimated as a function of
`application area fragmentation. Figure 5 shows the
`application area distribution for the BN mapping.
`BN and PL generated more-isolated applications.
`The smaller fragmentation observed in dynamic
`mapping came from the clustering strategy, which
`increased the size of contiguous blocks.
`Table 4 presents the results for scenario E, normal-
`ized as a function of the TS algorithm (for larger
`benchmarks, TS outperforms SA). The overall over-
`head imposed by dynamic mapping in such a com-
`plex scenario was small, as shown in columns PL
`and BN: 7% and 13% in average channel occupation,
`7% and 10% in average packet latency, 4% and 3% in
`total execution time, and 18% and 13% in communi-
`cation energy consumption.
`
`A2
`
`A7
`
`A2
`
`A0
`
`A5
`
`A5
`
`A3
`
`This PE is shared
`by A2 and A5
`(reuse)
`Application A7 in
`a contiguous area
`
`Free
`resource
`
`Figure 5. Application distribution for the BN algorithm in
`scenario E (An ⴝ applications; MP ⴝ manager processor). Each
`distinct shade of gray corresponds to a distinct application.
`
`Table 4. Summary of results for scenario E, normalized with regard
`to the TS algorithm.
`
`Dynamic
`
`Static
`
`Performance parameter
`
`Channel occupation (avg.)
`
`Channel occupation (SD)
`
`Packet latency (avg.)
`
`Packet latency (SD)
`
`Number of hops
`
`Total execution time
`
`Energy consumption
`
`PL
`
`1.07
`
`1.18
`
`1.07
`
`1.03
`
`0.93
`
`1.04
`
`1.18
`
`BN
`
`1.13
`
`1.27
`
`1.10
`
`1.11
`
`0.92
`
`1.03
`
`1.13
`
`SA
`
`1.13
`
`1.36
`
`1.16
`
`1.05
`
`1.59
`
`1.02
`
`1.35
`
`TS
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`1.00
`
`MPSOCS CONTAINING several tens of processors have
`recently been launched as commercial products. To
`function effectively, these complex MPSoCs must incor-
`porate distributed or centralized PE management.
`
`Such management includes, for example, frequency
`and voltage control, processor and network load mon-
`itoring, and task mapping. The research described
`here contributes to the last item, task mapping.
`
`September/October 2010
`
`33
`
`8
`
`
`
`Task Mapping
`
`According to our first set of experiments, the PL
`heuristic is the best solution. Compared to ad hoc
`mapping, the proposed algorithms allow up to 31%
`smaller channel load and up to 22% smaller packet
`latency. Additionally, the heuristics present an accept-
`able overhead cost. Local decisions taken by the algo-
`rithms make MMCL and MACL ineffective. When a
`new mapping does not reduce the average link
`load (or maximum load), these algorithms select
`the first available mapping option. NN and BN display
`some improvements but with gains inferior to PL.
`Our second set of experiments shows that the cost
`overhead of dynamic mapping compared to static
`mapping for a complex scenario was on average
`10% in channel occupation, 8.5% in latency, 3.5% in
`total execution time, and 15.5% in communication
`energy consumption. This is an acceptable overhead,
`considering the advantages offered by dynamic
`mapping:
`
` Smaller systems can be used, since only tasks
`being executed need be mapped into the system.
` The number of tasks can be greater than available
`system resources.
` The inclusion of new applications after design
`extends the MPSoC’s usability.
`
`Dynamic mapping’s main weakness is that it takes
`a partial view of the application graph, considering
`only the intercommunication of
`the task being
`mapped and its caller task. On the other hand, static
`mapping considers all tasks and resources together,
`using algorithms more expensive in terms of execu-
`tion time. In future research, we plan to take several
`directions to improve the performance of dynamic
`mapping. For example, when a given task starts its ex-
`ecution, the MP could start the mapping of all its slave
`tasks or reserve resources for them. A new heuristic
`could adopt a task migration strategy when communi-
`cation cost becomes too high for a given task. Finally