In this section, the steps of the proposed design flow are described. Since it has been assumed that the tasksets of the considered applications are known in advance, it is possible to perform the majority of the required computations statically. Consequently, the mapping problem can be split into two stages: off-line (static) and on-line (dynamic), as shown in Fig. 4. The computation time of the off-line part is not crucial and thus heuristics with even high complexity, such as genetic algorithms, may be used for runnable and label mappings.

During the application run-time, detection of the current mode is assumed to be done by observing a certain variable. When a value of this variable is being changed, the current runnable and label mapping might need to be switched. The mappings have been identified at the design-time while trying to minimise the amount of data to be migrated during the static mapping for both initial and non-initial modes. Schedulability analysis guarantees that even the worst case switching time does not violate the deadline required for mode changes. If such violation is unavoidable, either the states should be clustered, or the network bandwidth is to be increased.

In the remaining part of this section, we firstly explain the steps performed off-line, followed by the description of the dynamic stage.

### 4.1 Mode detection/clustering

During analysis of a modal system it may happen that runnables executed in neighbouring modes (i.e. the modes connected in the FSM) have similar WCETs and resource utilisations. In such case, there is little benefit in determining two separate mappings for these modes and migrating the runnables’ contexts during transitions between these modes. It would be more reasonable to cluster these states and treat them as a single mode in the further steps of the proposed approach.

Similarly, some transitions between modes may have to be done immediately, whereas others can be less time tight. If runnable contexts’ migration has to be performed quickly, for example between two consecutive runnable occurrences (jobs), the bandwidth needed to transfer the appropriate amount of data in that time may be unreasonably wide. In such situation, clustering of these modes shall also be considered.

The proposed approach is agnostic with respect to the chosen clustering method. In our implementation, the popular *k*-mean algorithm has been applied, whose idea was given in [26] for the first time. The features used for clustering are the WCETs (or the numbers of operations to be executed in the worst-case scenario) of particular runnables. Each mode is represented as a point in a *p*-dimensional vector space, referred later to as feature space. In the *k*-mean algorithm, a number of clusters, *k*, and *m* points in the *p*-dimensional feature space are provided as inputs. The number of clusters represents the number of groups in which all the points in the feature space need to be partitioned by employing a clustering algorithm, for example, the *k*-mean clustering algorithm. An appropriate value of *k* is often evident due to the knowledge about the relations between *m* modes in the considered application. If this knowledge is limited, one of the numerous existing solutions can be used to determine the right value of *k*, e.g. [27].

Initially, the *k* first points are treated as single-element clusters, and the remaining *m*−*k* points are assigned to the cluster with the nearest centroid based on the Euclidean distance. Then, the centroid for each cluster is recomputed. These two stages, i.e. assigning points to the cluster with the nearest centroid and the centroid re-computation are repeated until convergence. A set of *k* clusters is returned as output. The modes grouped into one cluster are merged into one mode in which WCET for each runnable is equal to the maximal WCET for the particular runnable in any mode grouped in this cluster.

This concept may be illustrated with the following simple example. Let us have an application with *p*=2 runnables *Γ*= [*τ*
_{1},*τ*
_{2}] in *m*=5 distinctive modes. As there are only two runnables, the *m* points in a feature space have two dimensions, one corresponding to the WCET for *τ*
_{1} and the second one corresponding to the WCET for *τ*
_{2}, and thus they can be shown on a plane. These feature points are presented as five circles in Fig. 5, where the OX and OY coordinates represent the WCETs (in ms) for *τ*
_{1} and *τ*
_{2}, respectively. The number of clusters has been set to *k*=3 and thus three centroids have been found, shown in the figure with crosses. The lines in the figure divide the plane into segments that are closer to a certain centroid than to the remaining ones. The modes described with the feature points belonging to the same segment are merged into a single mode. For example, in the uppermost segment two feature points can be found: [1,5] and [2,6]. After merging the corresponding modes, the WCET for the clustered mode equals *m*
*a*
*x*(1,2)=2 ms for *τ*
_{1} and *m*
*a*
*x*(5,6)=6 ms for *τ*
_{2}.

### 4.2 Spanning tree construction

In the proposed approach, the FSM describing the modes is traversed starting from the initial mode and then the runnable migration corresponding to each traversed transition is analysed. In this traversal, each mode should be analysed exactly once as only one mapping is assigned to one mode. If a particular mode can be reached from a number of different modes, the most probable transition shall be chosen. Hence the FSM describing mode changes should include weights denoting state transition probabilities. These probabilities can be given or determined during a simulation of the modal system. Then, the FSM is treated as a weighted directed connected graph *G*(*V*,*E*), where *V* is the set of vertices {*v*
_{1},*v*
_{2},…} and *E* denotes the set of directed edges. We firstly convert this graph into its undirected counterpart, *G*(*V*,*E*
^{′}), where set *E*
^{′} includes an edge (*v*
_{
k
},*v*
_{
l
}) if and only if (*v*
_{
k
},*v*
_{
l
})∈*E*∨(*v*
_{
l
},*v*
_{
k
})∈*E*. The weight of edge (*v*
_{
k
},*v*
_{
l
})∈*E*
^{′}, *ω*(*v*
_{
k
},*v*
_{
l
}) is equal to the sum of weights of edges (*v*
_{
k
},*v*
_{
l
})∈*E* and (*v*
_{
l
},*v*
_{
k
})∈*E*.

We use an algorithm for undirected graphs, as we take into consideration the probability of mode switching in both directions, i.e. the sum of these probabilities for two directed edges connecting these states in the related FSM (the weights cannot be thus treated as probabilities in the undirected graph as they may be higher than 1).

To guarantee a single analysis of each mode while following the most probable paths, a maximum spanning tree can be constructed. We recollect that a spanning tree of a graph *G*(*V*,*E*
^{′}) is its subgraph *T*(*V*,*E*
^{′′}) which is connected and whose number of edges is equal to the number of vertices minus 1, |*E*
^{′′}|=|*V*|−1. If \(\mathcal {T}\) denotes the set of all spanning trees of *G*, a maximum spanning tree \(T_{max}\left (V,E^{\prime \prime }_{max}\right)\) of *G* is a spanning tree if and only if:

$$\mathop{{\forall}}_{T (V,E^{\prime\prime}) \in \mathcal{T}} \sum_{(v,z) \in E^{\prime\prime}_{max}} \omega(v,z) \geq \sum_{(v,z) \in E^{\prime\prime}} \omega(v,z), $$

where *ω*(*v*,*z*) is the weight value assigned to the edge from a vertex *v* to *z*. A maximum spanning tree can be constructed in time *O*(|*E*
^{′}|*l*
*o*
*g*|*V*|) using the classic Prim–Jarník’s algorithm [28].

According to this greedy algorithm, a tree is initialised with an arbitrary vertex. In our implementation, we select the vertex corresponding to the initial state in the FSM. Then, in each step, one vertex is chosen and added to the tree. This selected vertex is not yet in the tree and is connected with any tree vertex with an edge having the largest weight. This operation is repeated until all vertices are added to the tree.

Let us illustrate this idea with a simple example, an FSM with three states, A, B and C, presented in Fig. 6
a. The corresponding undirected weighted graph is shown in Fig. 6
b. Vertex A is selected as the first vertex of the spanning tree (Fig. 6
c). Two vertices are adjacent to the spanning tree, B and C, which is shown in the figure with dashed lines. In the next step, vertex B is added to the spanning tree, as it is connected with vertex A with an edge with the largest weight, *ω*(*A*,*B*)=0.9 (Fig. 6
d). Vertex C is adjacent to the tree and can be connected to vertex A or B. Then, in the third step, vertex C is connected to vertex B as this edge has a larger weight, *ω*(*B*,*C*)=1.4>*ω*(*A*,*C*)=0.7. Since all the vertices have been added to the spanning tree, the Prim–Jarník’s algorithm is completed. The maximum spanning tree is presented in Fig. 6
e.

Notice that the operation performed in this step neither influences the application behaviour nor limits the possible mode transitions. It only makes the least frequent transitions not optimized during the further stages.

### 4.3 Static mapping

In the proposed approach, the algorithms for resource allocation in the initial and the remaining modes vary, and thus they are presented separately in the following two subsections.

#### 4.3.1 Initial mode

Algorithm 1 presents a pseudo-code of a genetic algorithm that can be used to identify a mapping for the initial mode. The algorithm ensures that no deadline violation occurs under the chosen allocation. We propose to use two fitness functions—measuring (i) the number of deadline violations and (ii) the total energy dissipated by the resources. The first fitness function value is of primary importance, as in a hard real-time system no deadline violation is allowed. However, among fully schedulable mappings, the one leading to a lower dissipated energy is chosen.

Each chromosome in the genetic algorithm contains genes of two types, as shown on the top of Fig. 7. The first *p* genes indicate the target cores for *p* runnables and the remaining |*Ψ*| genes (for a mesh NoC |*Ψ*|=*x*·*y*, where *x* and *y* are the mesh dimensions) specify the P-states of the consecutive cores.

In the algorithm, the following two main steps can be singled out.

Step 1. Initial population generation (line 1). An arbitrary number of random individuals (runnable mappings and P-states) is created.

Step 2. Creating a new population (lines 3–10). For each individual, values of the two fitness functions (the number of deadline violations and dissipated energy (lines 3–4)) are computed. Individuals with the same number of deadline misses are grouped together (line 5). The groups are then sorted with respect to the number of deadline violations in the ascending order (line 6). Inside each group, individuals are sorted according to their growing dissipated energy (line 7). The tournament selection is then performed, where individuals from a group with a lower number of deadline violations are always preferred, whereas among the individuals from one group the one with the lower dissipated energy is to be chosen (line 8). The computation of the tournament outcome is characterised with low overhead due to the appropriate ordering of the groups and individuals in each group performed earlier. The individuals winning the tournament are then combined using a typical crossover operation and mutated (line 9). Then, a new population is created from these individuals (line 10). Step 2 is repeated in a loop as long as a termination condition is not fulfilled, which can be a maximal number of generated populations or lack of improvement in a number of subsequent generations.

For example, assume that a population is comprised of four individuals, *i*
_{1}, *i*
_{2}, *i*
_{3} and *i*
_{4}. The evaluation of these individuals made in line 3 shows that for individuals *i*
_{1} and *i*
_{3} as many as 2 deadlines are missed, whereas the mappings for *i*
_{2} and *i*
_{4} are schedulable considering the P-state assignments encoded in these individuals. According to the energy dissipation evaluated in line 4, *i*
_{1} dissipates the lowest energy, followed by *i*
_{2}, *i*
_{3} and *i*
_{4} (in this order).

Since *i*
_{1} and *i*
_{3} violate the same number of deadlines, they are joined together in one group, *Group*
_{1}, according to line 5. Individuals *i*
_{2} and *i*
_{4} are grouped together in *Group*
_{2} as they do not violate any deadline. In each group, the individuals are sorted with respect to the dissipated energy in the ascending order (line 7).

In the tournament selection performed in line 8, each individual from *Group*
_{2} wins over any individual from *Group*
_{1}, as the number of violated deadlines is the more important criterion. So, for example, *i*
_{4} beats *i*
_{1} despite dissipating more energy. If a tournament is performed for two individuals from the same group, i.e. violating the same number of deadlines, the individual characterised with a lower energy dissipation is the winner. For example, if both individuals from *Group*
_{1} enter the tournament, *i*
_{1} becomes the winner.

#### 4.3.2 Non-initial modes

As mentioned earlier, it is of primary importance to migrate as little data as possible during mode changes to minimise the migration time and energy. However, it may be beneficial to migrate more data if the energy dissipated in the next mode is much lower than the migration energy. Thus, there could be some trade-off between migration data (or time) and energy dissipation in the next mode. It is a role of a designer to choose a proper solution from the Pareto frontier.

A mapping M is a vector of *p* core locations, \(\mathrm {M}=\,[\pi _{\tau _{1}},\ldots,\pi _{\tau _{p}}]\), where each element corresponds with the appropriate element of *Γ* (taskset) and can be substituted with any element of set *Π* (processing cores).

To perform optimization for the migration cost that considers the context length of the transmitted runnables, weight vector W is introduced. Each element of this vector \(\mathrm {W}=\,[w_{\tau _{1}},\ldots, w_{\tau _{p}}]\) is equal to the amount of data that has to be transferred when a particular runnable (*τ*
_{1},…,*τ*
_{
p
}) is migrated, including the labels to be read or written.

Let \(\mathcal {M}_{\alpha }\) and \(\mathcal {M}_{\beta }\) be sets of mappings (i.e. sets of M vectors) that are fully schedulable in a given system in modes *α* and *β*, respectively. The elements of the difference vector \(\phantom {\dot {i}\!}\mathrm {D}_{\mathrm {M}_{\alpha },\mathrm {M}_{\beta }}=\,[d_{\tau _{1}}, \ldots, d_{\tau _{p}}]\) indicate which runnables are to be migrated when the mode is changed from *α* to *β*. Each element *d*
_{
δ
}, *δ*∈{*τ*
_{1},…,*τ*
_{
p
}}, takes value 1 if the particular runnable is allocated to different cores in mappings \(\mathrm {M}_{\alpha } \in \mathcal {M}_{\alpha }\) and \(\mathrm {M}_{\beta } \in \mathcal {M}_{\beta }\), and 0 otherwise:

$$ d_{\delta}=\left\{ \begin{array}{ll} 1, & \text{if}\ \mathrm{M}_{\alpha,\delta} \neq \mathrm{M}_{\beta,\delta},\\ 0, &\text{otherwise}. \end{array} \right. $$

(1)

where M_{
α,δ
} and M_{
β,δ
} denote the *δ*-th element of vectors M_{
α
} and M_{
β
}, respectively. The migration cost *c* between two modes *α* and *β* is then computed in the following way

$$ c_{\mathrm{M}_{\alpha},\mathrm{M}_{\beta}}=\mathrm{D}_{\mathrm{M}_{\alpha},\mathrm{M}_{\beta}} \cdot \mathrm{W}^{\mathrm{T}}. $$

(2)

For example, we consider a taskset with three runnables *Γ*= [*τ*
_{1},*τ*
_{2},*τ*
_{3}]. The elements of vector W= [100,200,150] describe the context lengths (in bytes) of *τ*
_{1}, *τ*
_{2} and *τ*
_{3}, respectively. Let, there is one mapping in mode *α*, \(\mathcal {M}_{\alpha } = \{\mathrm {M}_{\alpha 1}\}\) and two mappings in mode *β*, \(\mathcal {M}_{\beta } = \{\mathrm {M}_{\beta 1}, \mathrm {M}_{\beta 2}\}\), where M_{
α1}= [*π*
_{1},*π*
_{1},*π*
_{2}], M_{
β1}= [*π*
_{1},*π*
_{2},*π*
_{2}] and M_{
β2}= [*π*
_{2},*π*
_{1},*π*
_{1}]. Thus the corresponding difference vectors equal \(\phantom {\dot {i}\!}\mathrm {D}_{\mathrm {M}_{\alpha 1}, \mathrm {M}_{\beta 1}}=\,[0,1,0]\) and \(\phantom {\dot {i}\!}\mathrm {D}_{\mathrm {M}_{\alpha 1}, \mathrm {M}_{\beta 2}}=\,[1,0,1]\). The migration costs between these mappings are \(c_{\mathrm {M}_{\alpha 1}, \mathrm {M}_{\beta 1}}=\mathrm {D}_{\mathrm {M}_{\alpha 1}, \mathrm {M}_{\beta 1}} \cdot \mathrm {W}^{\mathrm {T}}=200\) and \(c_{\mathrm {M}_{\alpha 1}, \mathrm {M}_{\beta 2}}=\mathrm {D}_{\mathrm {M}_{\alpha 1}, \mathrm {M}_{\beta 2}} \cdot \mathrm {W}^{\mathrm {T}}=250\) bytes. If minimisation of the migrated data size is the only criterion, mapping M_{
β2} shall be chosen for mode *β*.

A recursive greedy algorithm for reducing the amount of data transferred during mode changes is presented in Algorithm 2.

Since some cycles are likely to occur in a graph representing the finite state machine describing transitions between modes, a maximum spanning tree (ST) is to be built, as described earlier. Then the mode corresponding to the initial state of the FSM is selected as the current mode (line 1). For this mode, a set of schedulable mappings is generated, e.g. with Algorithm 1 (line 2). If more than one schedulable mapping is found, the one leading to the lowest energy dissipation is selected (line 3). Then for each direct successor of the ST vertex corresponding to the FSM initial state, the *FindMappingMin* procedure is executed (lines 4 and 5).

In the *FindMappingMin* procedure, a Pareto frontier of schedulable mappings for that successor vertex is found using two criteria: (i) minimal migration cost criterion represented by Eq. (2) and (ii) minimal energy dissipated in the next mode (line 1.1). The most suitable schedulable mapping is chosen from the Pareto frontier based on the design priorities (line 1.2). The *FindMappingMin* procedure is then recursively run for each direct successor of the ST vertex provided as the function parameter (lines 1.3 and 1.4).

More mappings could be delivered to the *FindMappingMin* procedure to browse a larger search space by skipping lines 3 and 1.2 in the algorithm and providing all elements of \(\mathcal {M}_{\alpha }\) instead of just one. It is the role of a designer to set priorities between the migration time and energy dissipation to select the most suitable solution from the Pareto frontier.

If Algorithm 2 is applied to the example spanning tree presented in Fig. 6
e, mode A is substituted to *α* as it corresponds with the initial state of the FSM shown in Fig. 6
a. Then the mapping M_{
α
} that is schedulable and dissipates the lowest amount of energy is determined using Algorithm 1. As there is only one direct successor of mode A in the spanning tree, the *FindMappingMin* procedure is executed for mode B. In this procedure, a Pareto frontier between schedulable mappings in mode B minimizing energy dissipation in B and the amount of data to be transferred during mode switching from A to B is determined. After selecting the appropriate mapping using the assumed design priorities, procedure *FindMappingMin* is executed again for mode C, the only successor of mode B in the spanning tree.

### 4.4 Schedulability analysis

The proposed runnable mapping technique aims to benefit from the modal nature of applications, but it also possesses new challenges. If the modes are treated independently from each other, the end-to-end schedulability of runnables and packet transmission in each mode can be analysed using equations from [29]. However, the instant of transition between the modes requires special attention, as additional migration-related traffic appears including the whole contexts of the runnables and labels to be migrated. To guarantee the taskset schedulability during migration, we propose to treat a migration process as any other asynchronous process in the typical schedulability analysis, i.e. to use so-called *periodic servers*, which are periodic tasks executing aperiodic jobs. When a periodic server is executed, it processes pending runnable migration. If there is no pending migration, the server simply holds its capacity.

The dynamic (i.e. changeable) part of the context shall be migrated at once using the last job of the periodic server. It means that the local memory locations that can be modified by the runnable must not be precopied, but migrated after the last execution of the runnable in the old location. This requirement can influence the minimum periodic server size (i.e. the time allocated to it by a scheduler in each hyperperiod, where hyperperiod is the least common multiple of all runnables’ periods) and, consequently, the network bandwidth, as it must be then wide enough to guarantee migration of the dynamic part before the next runnable’s job execution (in the new location).

In the proposed approach, any kind of periodic servers can be used. However, the trade-off between implementation complexity and ability to guarantee the deadlines of hard real-time runnables, as described for example in [30], shall be considered. More details regarding the applied schedulability analysis scheme in the proposed approach are provided in [24].

### 4.5 On-line steps

In the proposed approach, three steps are performed on-line: *Detection of current mode*, *Mapping switching* and *Changing voltage/frequency levels of cores*.

In all the ECUs known to the authors of this paper, the system modes are defined explicitly and there is a possibility of determining the current mode by observing some system model variables, similarly to [7]. (For example, in DemoCar such variable is named *_sm* and is stored in runnable *OperatingModeSWCRunnableEntity*.)

When the mode change is requested, an agent residing in each core prepares a set of packages with runnables to be migrated via the network. This agent is configured statically and is equipped with a table with information about runnables that need to be migrated during a particular mode change. Then the contexts of these runnables are migrated. In the following hyperperiods, runnables are transported using periodic servers of the length determined statically using schedulability analysis, as described earlier. The agent is aware of the number of periodic server jobs that have to be used during the whole migration process, and has the dynamic (volatile) portion of the context identified. This part of the context is to be transmitted in a single job of the periodic server, just after the last execution of the runnable at its old location. After migrating the dynamic part of the runnable’s context, it is removed from the earlier (migration source) core.

Simultaneously, the same agent can receive migration data from other agents in the network. When the precomputed (during the design-time) number of hyperperiods elapses, the contexts of these runnables are fully migrated and are ready to be executed on the migration target core.

Before the first execution of a runnable in a new mode, the agent switches the P-state of the processing core to the value determined during the static analysis, described earlier in this paper.

The details of the agent depend on the underlying operating system. Regardless of its implementation, *Detection of current mode* shall be characterised by low computational complexity and thus shall impose low overhead for the system during run-time. The number of the hyperperiods required for performing runnable migration during *Mapping switching* depends on the size of runnables and labels to be transferred, mappings, and network bandwidth, in particular flit size and timing constants for packet latencies while traversing one router and one link (*d*
_{
R
} and *d*
_{
L
}). This dependency will be explored in the following section.