 Research
 Open Access
Realtime semipartitioned scheduling of forkjoin tasks using workstealing
 Cláudio Maia^{1}Email authorView ORCID ID profile,
 Patrick Meumeu Yomsi^{1},
 Luís Nogueira^{1} and
 Luis Miguel Pinho^{1}
https://doi.org/10.1186/s1363901700795
© The Author(s) 2017
Received: 10 February 2016
Accepted: 21 August 2017
Published: 13 September 2017
Abstract
This paper extends the work presented in Maia et al. (Semipartitioned scheduling of forkjoin tasks using workstealing, 2015) where we address the semipartitioned scheduling of realtime forkjoin tasks on multicore platforms. The proposed approach consists of two phases: an offline phase where we adopt a multiframe task model to perform the tasktocore mapping so as to improve the schedulability and the performance of the system and an online phase where we use the workstealing algorithm to exploit tasks’ parallelism among cores with the aim of improving the system responsiveness. The objective of this work is twofold: (1) to provide an alternative scheduling technique that takes advantage of the semipartitioned properties to accommodate forkjoin tasks that cannot be scheduled in any pure partitioned environment and (2) to reduce the migration overheads which has been shown to be a traditional major source of nondeterminism for global scheduling approaches. In this paper, we consider different allocation heuristics and we evaluate the behavior of two of them when they are integrated within our approach. The simulation results show an improvement up to 15% of the proposed heuristic over the stateoftheart in terms of the average response time per task set.
Keywords
 Parallel tasks
 Semipartitioned scheduling
 Workstealing
 Multicore platforms
1 Introduction
Multicore platforms are now very common in the embedded systems domain as they provide more computing power for the execution of complex applications with stringent timing constraints. This boost in performance increases substantially the complexity of the scheduling problem of realtime tasks that execute upon these platforms. While a uniprocessor scheduling problem reduces to deciding when to schedule each task, a new dimension adds to this one when shifting to multicores: it must also be decided where to execute each task. In order to solve this rather challenging issue, several scheduling algorithms have been proposed in the literature (see [1] for a comprehensive and uptodate survey).
Another important feature of these platforms is that they make intratask parallelism possible by taking advantage of the task structure. At compile time, intratask parallelism can be extracted from application loops by using programming frameworks such as OpenMP [2]. These frameworks resort to dynamic scheduling strategies in order to schedule application tasks. One of the most common strategies in use is workstealing [3]. In summary, workstealing is a loadbalancing algorithm which allows an idle core to randomly steal some workload from a busy core, referred to as the “victim”, with the objective of reducing the average response time of a task executing on a target platform^{1}. While randomness in the selection of a victim is traditionally acceptable in several computing domains, no guarantee can actually be provided regarding the timing behavior of the tasks as there is a possibility of priority inversion among them. A solution to circumvent this limitation consists of using multiple percore priority doubleended queues (known as deques^{2}) [4].
In this paper, we consider forkjoin realtime tasks (i.e., a special case of parallel realtime tasks) in a semipartitioned scheduling context so that we can explore the potential parallelism of migrating tasks at runtime by resorting to the load balancing property provided by a variant of the workstealing algorithm [4]. The goal is to reduce the average response time of the tasks and create additional room in the schedule for lesscritical tasks (e.g., aperiodic and besteffort tasks).
We recall that semipartitioned scheduling [5–7] considers two steps: (step 1) a tasktocore mapping is performed at design time where a subset of tasks (the subset of nonmigrating tasks) is assigned to specific cores and is not allowed to migrate at runtime; (step 2) if a task cannot be assigned to any of the cores without jeopardizing its schedulability, then this task is referred to as a migrating task and is scheduled by using a global scheduling approach to seek for a valid schedule.
In the proposed approach, the behavior of each migrating task is further restricted. At runtime, each job activation of a migrating task follows a jobtocore execution pattern elaborated at design time in order to improve both the schedulability of the system and its utilization factor. In addition, we consider a tasklevel migration strategy, i.e., various jobs of a migrating task are allowed to be assigned to different cores, but once a job is assigned to a core, migrations of this job prior to its completion are forbidden. In contrast, joblevel migration approaches allow each job assigned to a core to migrate to another core prior to its completion. By design, the proposed model limits the number of migrations, which has been recognized as one of the main sources of nondeterminism on multicores, by limiting workstealing to occur between cores that share a copy of a task^{3}.
Contributions The contribution of this work is fourfold: (1) we present a complete framework that supports the scheduling of forkjoin realtime tasks onto a multicore platform together with the associated schedulability analysis. (2) As we assume that cores that share jobs of a migrating task have a local copy of this task, we reduce both the overhead concerning task fetching and the number of task migrations due to the offline jobtocore mapping. (3) As the parallel regions of each forkjoin task can execute simultaneously on different cores, we take advantage of the workstealing mechanism to reduce the average response time of the tasks without jeopardizing the schedulability of the whole system. To the best of our knowledge, we are the first using workstealing in the context of a semipartitioned scheduling scheme. (4) We extend the work presented in [8] by comparing different allocation heuristics in terms of their allocation behavior. For two of these heuristics, we evaluate the improvement given by using workstealing in terms of task average response times. Moreover, we explain how to integrate tasks with a density greater than one into our framework.
Paper organization The rest of this paper is organized as follows. Section 2 presents the related work. Section 3 describes the model of computation used throughout the paper. Section 4 details our proposed approach. Section 5 provides an example of how the framework works. Section 6 explains how decompositionbased models can be used to accommodate tasks with density greater than one. Section 7 presents the schedulability analysis of the proposed approach. Section 8 reports on simulation results from experiments on synthetic task sets. Finally, Section 9 concludes the paper.
2 Related work
Both decompositionbased techniques [9, 10, 12] and nondecompositionbased techniques [11, 13] have been proposed to analyze the schedulability of these three task models. Specifically, resource and capacity augmentation bounds can be used to evaluate the schedulability of all task models while responsetime analysis [14, 15] can be used to analyze synchronous parallel tasks.
Unfortunately, very few techniques exist in the literature for the analysis of semipartitioned scheduling of parallel tasks. Bado et al. [16] proposed a semipartitioned approach with joblevel migration for forkjoin tasks, which is similar to the one in [9], but due to the assignment methods proposed in their paper for the offsets and local deadlines, they did not provide any guarantee on the fact that subtasks actually execute in parallel. While their work is similar to ours w.r.t. the adopted class of schedulers (semipartitioned), we differ in that we relax the constraint of restricting the task parallelism and we use tasklevel migration instead of joblevel migration, thus further reducing the number of migrations at runtime.
3 System model
Task specifications We consider a set τ=def{τ _{1},…,τ _{ n }} composed of n sporadic forkjoin tasks. Each sporadic forkjoin task τ _{ i }=def〈S _{ i },D _{ i },T _{ i }〉, 1≤i≤n, is characterized by a finite sequence of segments \(S_{i} \stackrel {\text {def}}{=} \left [s_{i}^{1}, s_{i}^{2}, \ldots, s_{i}^{n_{i}}\right ]\), with \(n_{i} \in \mathbb {N}\), a relative deadline D _{ i } and a period T _{ i }. These parameters are given with the following interpretation: at runtime, each task τ _{ i } generates a potentially infinite number of successive jobs τ _{ i,j }. Each job τ _{ i,j } has a finite sequence of segments S _{ i }, arrives at time a _{ i,j } such that a _{ i,j+1}−a _{ i,j }≥T _{ i }, and must be completed within [a _{ i,j },d _{ i,j }) where d _{ i,j }=defa _{ i,j }+D _{ i } is its absolute deadline. Each segment \(s_{i}^{k} \in S_{i}\) (with 1≤k≤n _{ i }) is composed of a set of independent subtasks ^{5} \(t_{s_{i}^{k}} \stackrel {\text {def}}{=} \left \{t_{s_{i}^{k}}^{1}, \ldots, t_{s_{i}^{k}}^{v_{k}}\right \}\), where v _{ k } denotes the number of subtasks belonging to segment \(s_{i}^{k}\), and the sequence S _{ i } represents dependencies between segments. That is, for all \(s_{i}^{\ell }, s_{i}^{r} \in S_{i}\) such that ℓ<r, the subtasks belonging to \(s_{i}^{r}\) cannot start executing unless those of \(s_{i}^{\ell }\) have completed. The execution requirement of subtasks \(t^{q}_{s_{i}^{k}}\) (with 1≤q≤v _{ k }) is denoted by \(e^{q}_{s_{i}^{k}}\). The total execution requirement of task τ _{ i }, denoted by C _{ i }, is the sum of the execution requirements of all the subtasks in S _{ i }, i.e., \(C_{i} \stackrel {\text {def}}{=} \sum _{k=1}^{n_{i}} \sum _{q=1}^{v_{k}} e^{q}_{s_{i}^{k}}\). Every subtask is assumed to execute on at most one core at any time instant and can be interrupted prior to its completion by another subtask with a higher priority. A preempted subtask is assumed to resume its execution on the same core as the one on which it was executing prior to preemption. We assume that each preemption is performed at no cost or penalty. The minimum execution requirement of task τ _{ i }, denoted as P _{ i }, is defined as the time that τ _{ i } takes to execute when it is assigned to an infinite number of cores^{6}, i.e., \(P_{i} = \sum _{k=1}^{n_{i}} c_{s_{i}^{k}}\), where \(c_{s_{i}^{k}}\) denotes the worstcase execution time among the subtasks of segment k. The utilization factor of τ _{ i } is \(U_{i} = \frac {C_{i}}{T_{i}}\) and its density is \(\lambda _{i} = \frac {C_{i}}{\min (D_{i}, T_{i})}\). The total utilization factor of τ is \(U_{\tau } \stackrel {\text {def}}{=} \sum _{i=1}^{n} U_{i}\) and its total density is \(\lambda _{\tau } \stackrel {\text {def}}{=} \sum _{i=1}^{n} \lambda _{i}\). For each task τ _{ i }, we assume D _{ i }≤T _{ i }, which is commonly referred to as the constraineddeadline task model. The task set τ is said to be \(\mathcal {A}\)schedulable if algorithm \(\mathcal {A}\) can schedule τ such that all the jobs of every task τ _{ i }∈τ meet their deadline D _{ i }.
The left side of Fig. 1 illustrates a forkjoin task τ _{ i } with n _{ i }=5 segments, three are sequential segments (s ^{1},s ^{3}and s ^{5}) with one subtask each and two are parallel segments: s ^{2} containing three subtasks and s ^{4} containing two subtasks. All the subtasks in the parallel segments are independent from each other and therefore can execute in parallel. On the upper right side of the figure, it is possible to observe the task structure framed according to the timing properties of the task (P,D,T), and on the bottom right side, it is possible to observe the task’s serialized representation (i.e., task execution without parallelism).
Each migrating task is modeled as a multiframe task. The multiframe task model (as presented by Mok and Chen [17] and later generalized by Baruah et al. [18]) allows system designers to model a task by using a static and finite list of execution requirements, corresponding to successive jobs (or frames as they are named in this model). Specifically, by repeating this list (possibly ad infinitum), a periodic sequence of execution requirements is generated such that the execution time of each frame is bounded from above by the corresponding value in the periodic sequence.
Platform and scheduler specifications We consider a platform π=def{π _{1},π _{2},…,π _{ m }} comprising m homogeneous cores, i.e., all the cores have the same computing capabilities and are interchangeable. Each core runs a fully preemptive Earliest Deadline First (EDF) scheduler. EDF scheduling policy dictates that the smaller the absolute deadline of a job, the higher its priority. The schedulability of a task set scheduled by following the EDF scheduler upon a uniprocessor platform can be evaluated by using the Demand Bound Function (DBF) [19]. The DBF of task τ _{ i } at any time instant t≥0 is defined as \(\text {DBF}(\tau _{i},t)\stackrel {\text {def}}{=} \left (\left \lfloor \frac {t  D_{i}}{T_{i}} \right \rfloor + 1 \right) \cdot C_{i}\) and the DBF of task set τ is derived as \(\text {DBF}(\tau,t) \stackrel {\text {def}}{=} \sum _{\tau _{i} \in \tau } \text {DBF}(\tau _{i},t)\). The notations used throughout the paper are summarized in Appendix: Table 1.
We allow workstealing only among the cores that execute a migrating task. Jobs of migrating tasks execute on selected cores according to an execution pattern that is determined offline. By allowing workstealing only among these cores, a reduction of the average responsetime of each migrating task is possible, thus contributing to the reduction of the overall system responsiveness.
Our framework assumes a sharedmemory model with similar properties (multithreaded, shared address space, etc.) than the parallel frameworks that integrate workstealing (such as OpenMP).
4 Proposed approach
We propose a semipartitioned model of execution with workstealing for forkjoin tasks. The proposed approach consists of three phases referred to as (i) task assignment, (ii) offline scheduling, and (iii) online scheduling.
4.1 Task assignment phase
In [8], a variant of the firstfit decreasing (FFD) heuristic, hereafter referred to as FFDO^{7}, was selected. FFDO first divides the tasks into two classes: (1) tasks with λ _{ i }≤0.5 (light tasks) and (2) tasks with λ _{ i }>0.5 (heavy tasks)^{8}. The next step is to apply the classical FFD to light sequential tasks first and then to heavy sequential tasks. After this step completes, FFDO selects the light parallel tasks and then the heavy parallel tasks, again using FFD as the packing heuristic. Intuitively, by assigning sequential tasks first followed by the parallel tasks, the probability of having parallel tasks unallocated after the first phase increases.
All the tasks successfully assigned to the cores are referred to as nonmigrating tasks, and the remaining tasks, i.e., those that cannot be assigned by the heuristic to any core without jeopardizing its schedulability, are referred to as candidate migrating tasks.
At the end of the assignment phase, if all tasks are assigned to cores, then there is no candidate migrating task and therefore no migrating task in the system. In this case, there is no need for parallelization and workstealing as a fully partitioned assignment of the tasks to the cores has been found. Using workstealing in this situation would just help in loadbalancing the execution workload at the cost of allowing for unnecessary migrations. Due to this observation, workstealing is forbidden for nonmigrating parallel tasks. In the other case, if a task cannot be assigned to any core without jeopardizing its schedulability, then this task is deemed as a candidate migrating task and is treated as a multiframe task. The system is deemed schedulable if and only if an execution pattern is found for each candidate migrating task such that all the timing requirements of the system are met.
The goal of this assignment behavior is to increase the possibility of benefiting from parallelism in the third phase of the approach as a way to reduce the responsetime of the tasks. For instance, some parallel tasks may not fit into the cores in this first phase, and if this is the case, such tasks can be rechecked in the second phase of the approach by treating them as multiframe tasks. If an execution pattern is found for the multiframe task, then these tasks can benefit from workstealing in the third phase.
4.2 Offline scheduling phase
After the task assignment phase, let \(\tau ^{\pi _{j}}\phantom {\dot {i}\!}\) denote the set of tasks assigned to core π _{ j } (with 1≤j≤m). It follows that \(\tau ^{\pi _{j}} = \tau ^{\pi _{j}}_{\text {NM}} \cup \tau ^{\pi _{j}}_{\text {M}}\) where \(\tau ^{\pi _{j}}_{\text {NM}}\) denotes the subset of nonmigrating tasks and \(\tau ^{\pi _{j}}_{\text {M}}\) denotes the subset of migrating tasks assigned to π _{ j }.
We remind the reader that each core runs an EDF scheduler, so the schedulability of the nonmigrating tasks on each core is guaranteed as long as its load is less than 1. Concerning the migrating tasks, their jobs are distributed among the cores by following an execution pattern that does not jeopardize the schedulability of each individual core. To compute this pattern, the number of frames of each migrating task is computed as follows.
Definition 1
In Eq. 1, \(\text {lcm}_{\tau _{j} \in \tau } \{T_{j}\}\) denotes the least common multiple of the periods of all the tasks in τ. Goossens et al. [20] proved that this number of frames per migrating task is conservative and safe.
Definition 2
In Eq. 2, M[i,j] is a matrix of integers M[1…n,1…m] that tracks the current jobtocore assignment where M[i,j]=x means that x jobs of task τ _{ i } out of k _{ i } will execute on core π _{ j } (1≤i≤n and 1≤j≤m).
To the best of our knowledge, the uniform assignment given by Eq. 2 is the best result found in the literature for finding execution patterns for migrating tasks. An alternative approach is the generation of patterns via enumeration. Equation 2 is part of a set of algorithms that were proposed in [7] for the finding of patterns for multiframe tasks. The intuitive idea of these algorithms is to find the largest number of frames (jobs) that can be assigned to each core such that the migrating task is deemed schedulable. The result in [7] was integrated into our approach.
4.3 Online scheduling phase

R _{1}: At least one selected core must be idle when there are parallel subtasks awaiting for execution.

R _{2}: Idle selected cores are allowed to steal subtasks from the deque of another selected core.

R _{3}: When stealing workload, the idle core must always steal the highest priority parallel subtask from the list of deques (as proposed in [4]) in order to avoid priority inversions (this situation occurs when the number of migrating tasks is greater than 1 and the tasks have different priorities).

R _{4}: After selecting a parallel subtask to steal, say from core A to core B, an admission test must be performed on core B to guarantee that its schedulability is not jeopardized by this additional workload.
We recall that we avoid the overhead of fetching the code of the task from the main memory as the code of the migrating task is already loaded on the selected cores after the execution of the first job in a selected core. Whenever a core performs a steal, data is fetched from the memory of another core, which is to a certain extent equivalent to a migration. However, only input data is fetched in this case. Moreover, the number of migrations is limited by the tasktocore mapping (performed offline), which forces a job to execute in the preassigned cores instead of migrating between an arbitrary number of cores as it would happen in a global approach.
5 Example of the approach
This section illustrates the proposed approach. We consider the task set τ={τ _{1},τ _{2},τ _{3},τ _{4}} with the following parameters (τ _{ i }={C _{ i },D _{ i },T _{ i }}): τ _{1}={3,5,6},τ _{2}={3,5,8},τ _{3}={2,3,4},τ _{4}={1,8,8}. We assume that all the tasks have a sequential behavior except τ _{1} for which the execution consists of three regions: (i) a sequential region of one time unit, then (ii) a parallel region of two subtasks of 0.5 time units each, and finally, (iii) a sequential region of one time unit. We assume that tasks in τ are released synchronously and scheduled on the homogeneous platform π={π _{1},π _{2}}. Finally, we assume that an EDF scheduler is running on each core.
Now, let us apply our proposed methodology to this task set. There is a single parallel task in the system:
(1) Task assignment phase: during this phase, τ _{3} and τ _{4} are assigned to π _{1} and τ _{2} is assigned to π _{2}. For the same reasons as in the previous case, task τ _{1} can neither be assigned to π _{1} nor to π _{2}, so it is considered as a candidate migrating task.
(2) Offline scheduling phase: during this phase, an execution pattern which does not jeopardize the schedulability of the cores for the migrating task τ _{1} is found. Task τ _{1} is then treated as a multiframe task on each core with the following characteristics as k _{ i }=24/6=4: \(\tau _{1}^{1} = ((3, 0, 0, 0), 5, 6)\) and \(\tau _{1}^{2} = ((0, 3, 3, 3), 5, 6)\). This is given with the interpretation that the first job of τ _{1} executes in core 1 and the remaining 3 jobs execute in core 2.
(3) Online scheduling phase: during this phase, task τ _{1} takes advantage of the workstealing mechanism in order to reduce its average response time. Indeed, at time instant t=3, core π _{1} is executing the parallel region of task τ _{1} and core π _{2} is idle with sufficient resources, so it can steal one parallel subtask from the deque of π _{1}. The same situation occurs again at time t=7.5. Figure 2 (right side) illustrates the resulting schedule; the system is schedulable.
6 Tasks with density greater than 1
In [8], we considered a model that only supports tasks with density no greater than one (λ _{ i }≤1). Nevertheless, it is possible to overcome this limitation by recurring to decompositionbased techniques. This section provides an example of task decomposition using the technique proposed in [9] and discusses the implications of combining such an approach with workstealing.
Decompositionbased techniques ([9,10,12]) traditionally convert tasks with density greater than one into a set of constraineddeadline sequential subtasks, each of which with density no greater than one. These approaches try to avoid parallel structures by serializing parallel tasks as much as possible so that they can take advantage of schedulability techniques developed for sequential tasks.
In [9], the authors propose the task stretch transform algorithm, which uses the available slack^{10} of the task to proportionally stretch (i.e., serialize) parallel subtasks or parts of them in what is called a master string. The master string is assigned to a core and has an execution time length equal to D _{ i }=T _{ i }. The remaining parallel subtasks that cannot be combined in the master string are assigned intermediate releases and deadlines so that they become constraineddeadline tasks.
Some important aspects should be highlighted considering the application of decompositionbased approaches, specially regarding workstealing. Decomposition is useful as it allows one to know if a certain task set is schedulable offline. If a task with density greater than one is identified as a migrating task, then it may be subject to stealing. When stealing a subtask from such a task and its release offset and intermediate deadline are kept, then this task will not benefit from the stealing operation as its response time will not decrease due to the precedence constraints imposed by the master string. However, the offered idle time can be used to execute lower priority tasks or even steal work from other cores. Another option is to handle the offset constraints carefully during runtime so that intermediate deadlines are guaranteed while ensuring that no deadlines are shifted.
7 Schedulability analysis
This section derives the schedulability analysis of a set of constraineddeadline forkjoin tasks onto a homogeneous multicore platform. A modification of the semipartitioned model is adopted (see Section 4), and we assume that each core runs an EDF scheduler, while allowing workstealing among the “selected cores”, i.e., cores that share a copy of a migrating task. A schedulability analysis is performed in each phase of the proposed approach and works as follows.
(1) Task assignment phase: during this phase, the schedulability of the system is performed by applying the traditional DBFbased analysis [19] to nonmigrating tasks, as explained in Section 3.
(2) Offline scheduling phase: during this phase, we make sure that the additional workload added to each core concerning the assignment of the migrating tasks does not jeopardize the schedulability of the core. Specifically, for each migrating task, say τ _{ i }, we use a modified DBFbased schedulability test as presented in [7]. In this test, the execution pattern of each migrating task τ _{ i } is taken into account. More precisely, the number of intervals of length (k _{ i }·T _{ i }) occurring in any interval of length t≥0 is computed as \(s \stackrel {\text {def}}{=} \Big \lfloor \frac {t}{k_{i} \cdot T_{i}} \Big \rfloor \); since [0,t)=[0,s·k _{ i }·T _{ i })∪[s·k _{ i }·T _{ i },t), then the number of frames that contribute to the additional workload on core π _{ j } consists of two terms: (i) the number of nonzero frames in the interval [0,s·k _{ i }·T _{ i }] denoted as \(s \cdot \ell _{i}^{j}\) (where \(\ell _{i}^{j}\) is the number of frames out of k _{ i } that were successfully assigned to π _{ j }). The corresponding workload is \(s \cdot \ell _{i}^{j} \cdot C_{i}\), and (ii) an upperbound on the number of nonzero frames in the interval [s·k _{ i }·T _{ i },t) denoted as \(nb_{i}(t) = \Big \lfloor \frac {(t\ \text {mod}(k_{i} \cdot T_{i}))  D_{i}}{T_{i}} \Big \rfloor + 1\). The corresponding workload is \( w_{i}^{j} = \text {max}_{c=0}^{k_{i}1}\left (\sum _{\eta =c}^{c + nb_{i}(t)1} C_{i, \eta \ \text {mod}\ k_{i}}\right)\). It follows that an upperbound on the total workload associated to task τ _{ i } on core π _{ j } is computed as \(\text {DBF}_{j}(\tau _{i},t) \stackrel {\text {def}}{=} s_{i} \cdot \ell _{i}^{j} \cdot C_{i} + w_{i}^{j}\). Consequently, \(\text {DBF}\left (\tau ^{\pi _{j}}_{\text {M}}, t \right) \stackrel {\text {def}}{=} \sum _{\tau _{i} \in \tau ^{\pi _{j}}_{\mathrm {M}}} \text {DBF}_{j}(\tau _{i},t)\).
In Eq. 3, \(\text {DBF}\left (\tau ^{\pi _{j}}_{\text {NM}}, t \right)\) represents the demand for the nonmigrating tasks assigned to π _{ j } in the task assignment phase.
A workstealing operation is feasible from one core, say core A, to another core, say core B, if core B can execute the stolen workload (i.e., a parallel subtask from the deque of core A) before the end of each stealing window (μ _{1} and μ _{2} in the example). Such time instants are denoted as the intermediate deadlines for the stolen subtask. To compute the intermediate deadline for each stealing window, we can take advantage of the slack available for each job. Thus, the intermediate deadline of the n ^{th} parallel segment can be computed as: \(d^{(n)}_{s} \stackrel {\text {def}}{=} \phi _{n} + m_{s} * c_{s_{i}^{(n)}} + \text {slack}(\phi _{n})\). In this equation, ϕ _{ n } denotes the time instant at which the n ^{th} parallel segment spawns the subtasks, m _{ s } denotes the number of subtasks spawned in segment n, \(c_{s_{i}^{(n)}}\) denotes the worstcase execution time among the tasks in segment n, and slack(ϕ _{ n }) represents the slack of the job at time ϕ _{ n }.
8 Simulation results
This section presents the results of simulating our approach on a set of synthetic and randomly generated task sets. The simulation environment is described next.
Considered platform We consider a platform consisting of two or four homogeneous cores.
Task generation Each task τ _{ i } can be sequential or parallel. The number of each type of tasks depends on the generation itself and is not controlled beforehand. Tasks are created until the total utilization of the task set does not exceed the total platform capacity (i.e., U _{ τ }≤m).
Tasks are created by randomly selecting a number of segments k∈ [1,3,5,7]. When k=1, the task is sequential; otherwise, it is parallel. In case of a parallel task, the number of subtasks is n _{subtsk}∈ [k,10]. The worstcase execution time per subtask (C_{i,subtsk}) in each task varies in the range [1,max_Ci_subtsk] where max_Ci_subtsk=2 for performance reasons. We compute the worstcase execution time of each task as \(C_{i} = \sum _{\forall \ \text {subtsk} \in \tau _{i}} \mathrm {C}_{\text {i,subtsk}}\) ^{11}. Then, we derive the remaining parameters: the period T _{ i } and utilization U _{ i }. The period T _{ i } is uniformly generated in the interval [C _{ i },n _{subtsk}∗max_Ci_subtsk∗2]. This interval allows us to have a task utilization \(\left (\text {recall that}\ {U_{i} = \frac {C_{i}}{T_{i}}}\right)\) that falls in the interval [0.50,1] if all nodes are assigned max_Ci_subtsk or [0.25,1] if all nodes are assigned the minimum value for C_{i,subtsk} ^{12}. To generate execution patterns for the migrating tasks, we use Eq. 2 first and if no pattern is found we follow an enumeration approach. In our experiments, D _{ i }=T _{ i }. This procedure is repeated until 1000 task sets with migrating tasks are generated for two and four cores.
Selected heuristics In order to evaluate the performance of FFDO, we have conducted benchmarks against other wellknown binpacking heuristics, namely the standard firstfit decreasing (FFD), bestfit decreasing (BFD), and worstfit decreasing (WFD). FFD assigns each task to the first core from the set of cores with sufficient idle time to accommodate it; BFD assigns each task into the core which after the assignment minimizes the idle time among all cores; and WFD assigns the task to the core which after the assignment maximizes the idle time among all cores. All the heuristics, except FFDO, group the tasks into sequential and parallel tasks and sort each group in a decreasing order of task utilization.
The task sets schedulable by using WFD can be divided into four groups: 26.85% of these task sets are schedulable by using both heuristics; 24.51% are not schedulable by using FFDO due to k _{ i } ^{13}; 43.19% are not schedulable by using FFDO with a k _{ i } value in the range of valid values; and finally, 5.45% of the task sets are deemed not schedulable with FDDO after applying the heuristic. Overall, in a twocore setting, the total number of task sets that are schedulable by using WFD is 257, which represents an increase of 157% over FFDO for the same input. From the diagram, the majority of the task sets that are schedulable by using WFD fit in a potential feasible region for FFDO heuristic (43.19%) — here, all task sets have migrating tasks and k _{ i } values that fit in the range of valid values but no feasible pattern is found. These results still hold for four cores but to a less extent as only 17.9% more task sets were schedulable by using WFD over FFDO.
We conjecture that WFD behaves better than FFDO (even though FFDO has a higher percentage of unassigned tasks as shown in Fig. 8) for smaller number of cores because of the tasktocore assignment. Depending on the granularity of the utilization of the task sets, more empty space may be available globally in the cores when performing the task allocation for a small number of cores. These idle slots make it possible for our patternfinding procedure to find enough room to fit a job of a task when computing the execution pattern for a migrating task. However, as the number of cores increases, WFD naturally balances the workload through the cores, whereas FFDO assigns the workload in the initial cores leaving more room in later cores. For this reason, we envision that WFD will have the tendency to behave either equally to or even worse than FFDO with the increase in the number of cores.
Considered metrics In order to evaluate the proposed approach, we measure the gain obtained in terms of the average worstcase response time for each schedulable task set. Specifically, for each task set, we generate the complete schedule for the two approaches: the approach that schedules migrating tasks without applying the workstealing mechanism among the selected cores, denoted as ApproachNS; and the approach that applies the workstealing mechanism among the selected cores, denoted as ApproachS. After generating both schedules for each task set, we compute the average responsetime of the jobs of each task throughout the hyperperiod by adding the response time of each individual job and by dividing the obtained result by the number of jobs in one hyperperiod. This process is applied to both approaches. The improvement, i.e., the gain of ApproachS over ApproachNS is computed by applying the following formula for each task τ _{ i }: \(AV_{\tau _{i}} = \frac {AV^{NS}_{\tau _{i}}  AV^{S}_{\tau _{i}}}{AV^{NS}_{\tau _{i}}} \cdot 100\), where \(AV^{NS}_{\tau _{i}}\) denotes the average responsetime for task τ _{ i } in ApproachNS and \(AV^{S}_{\tau _{i}}\) denotes its average responsetime in ApproachS. It follows that the average gain for each task in the task set is computed by dividing AV _{ τ }: \(AV_{\tau } = \frac {1}{\tau } \cdot \sum _{\tau _{i} \in \tau } AV_{\tau _{i}}\).
Interpretation of the results The improvement in terms of average response time per task (in %) is grouped by utilization—see Fig. 10—when using ApproachS over ApproachNS. For each subfigure, the distribution of data is depicted in the form of box plot. In the plot, for each utilization value, it is possible to see the minimum and maximum values of gain per task, the median and the mean (in the form of a diamond shape), the first and third quartiles, and finally, the outliers in the shape of a cross. The line in red depicts a linear regression on the data (the mean value was used to compute the regression) in order to depict the pattern of prediction of the gain per task.
Considering two cores: for task sets with a high utilization (over 1.55), there is a clear illustration of the gain of the proposed approach. In the best case, this gain reaches nearly 15% for FFDO and nearly 12% of the average response time per task for WFD, which is non negligible. As the utilization of the task sets increases, the gain per task decreases. This is expected due to the increasing lack of idle time available for stealing. The trend shows that above 1.95 of utilization, the workstealing mechanism becomes of little interest. This is explained by the fact that the total workload on each core is very high, thus leaving very small room for improvement on the average response time of each migrating task through workstealing. It is important to note that task sets with utilizations below 1.55 using FFDO and 1.45 using WFD are not included in the plot as they do not contain any migrating task.
Considering four cores: the trend is similar to the one depicted for two cores. This trend is also shown by the linear regression line where it is possible to predict the average gain per task as a function of the utilization of the task set. The regression shows that for lower utilizations in two cores, the expected improvement starts at 2.3% for FFDO and 3.3% for WFD. For four cores, it starts at 1.4% for both heuristics. We can also observe that the expected improvement decreases with an increase in the tasks’ utilization. This behavior suggests that workstealing is useful for task sets with migrating tasks with a utilization that span from the lowest possible utilization for task sets with migrating tasks up to the platform capacity. Closer to this upper limit, the benefits of using workstealing are limited. From the observed behavior in two and four cores, we conjecture that the proposed approach will behave similarly when the number of cores increases.
Overheads of the approach This work shows that it is possible to decrease the average response time of tasks and use this newly generated free time slots to execute less critical tasks (e.g., aperiodic or besteffort tasks). While such a decrease involves overhead costs, such as the number and cost of migrations or even the impact of online admission control on the overall approach, we did not explicitly measure them. Still, we provide an overview of the existing costs and their possible impact on system performance.
We assume that cores that share a migrating task have a local copy of this task. However, keeping task copies is platform dependent as for some platforms it might not be possible to have copies due to memory constraints. In our approach, local copies are used for migrating tasks which might be subject to stealing, and having a local copy prevents fetching the task code from the main memory. Whenever a stealing operation occurs, a core fetches data from another core’s memory in order to help in the execution of the task. While this is not a task migration per se, it has some commonalities as data needs to be moved from one core to another. This may cause interference in the execution of other tasks in the system (for instance due to the existence of shared resources). In our approach, this overhead only occurs when stealing occurs and is performed by a core that is idle, so part of the cost is supported by the idle core (which is negligible due to the idleness of the core). Considering the number of data transfers, this number can be bounded in our framework as in the worstcase the number of data fetches when stealing depends on the number of subtasks in each segment and the number of cores that share the task.
Considering the online admission control, our test requires the current time instant and the available slack at a specific time instant. Both of these variables can be easily computed in any given platform either by using the platform timing functions and a cumulative function that computes the slack for the current job. Therefore, we consider that this does not pose any significant overhead in our approach.
9 Conclusions
In this paper, we combined techniques that allow us to schedule finegrained parallel forkjoin tasks onto multicore platforms. By using the proposed technique, we can schedule systems with high utilizations. Moreover, the proposed technique takes advantage of the semipartitioned scheduling properties by offering the possibility to accommodate parallel tasks that cannot be scheduled in any pure partitioned environment, while reducing the migration overhead which has been shown to be a traditional major source of nondeterminism in global approaches. Parallel tasks are heavy in their nature, and therefore, a natural candidate for this model if execution time constraints is present. Our results show that by using workstealing, it is possible to achieve an average gain on the response times of the parallel tasks between 0 and nearly 15% per task, which may leave extra idle time in the schedule to execute less critical tasks in the platform (i.e., aperiodic, besteffort).
10 Endnotes
^{1} Note that the balance of the platform workload at runtime also allows for a better control of the platform energy consumption [21,22].
^{2} A deque is a special type of queue which also works as a stack.
^{3} Two or more cores executing a migrating task share a copy of this task.
^{4} There is no restriction on the execution requirement of each node, and the execution time of each node may vary from one node to another.
^{5} There is no communication, no precedence constraints and no shared resources (except for the cores) between subtasks.
^{6} A task which consists of a single subtask in each of its segments is considered a sequential task.
^{7} We have explored alternative heuristics (see Section 8).
^{8} The threshold for classifying tasks varies in the literature, nevertheless a density of 0.5 is usually regarded as a good threshold for classifying tasks.
^{9} These cores are also referred to as “selected cores”.
^{10} Slack is the maximum amount of time that the remaining computation time of a job can be delayed at a time instant t (with a _{ i,j }≤t≤d _{ i,j }) in order to complete within its deadline.
^{11} By considering the worstcase execution time for each subtask in the experiments we are evaluating the benefits of using workstealing in the worst possible scenario.
^{12} As we evaluate the behavior of each task set in the interval [0,H], where H denotes the least common multiple of the periods of all the tasks in the task set, and as T _{ i } in our generation depends on C _{ i }, the higher the C _{ i }, the higher the T _{ i } and consequently, the higher the hyperperiod of the task set. By limiting C_{i,subtsk} we are also limiting the amount of time we need to generate the schedule.
^{13} As explained in [8], we reject task sets that have a number of frames over 10 for performance considerations. In summary, the complexity of the computation of the migrating patterns increases for large k _{ i }, which leads to higher computation times.
11 Appendix
Notation table
Symbols  Description 

τ  Set of n tasks 
D _{ i }  Relative deadline of task τ _{ i } 
T _{ i }  Period of task τ _{ i } 
a _{ i,j }  Arrival time of job j of task τ _{ i } 
d _{ i,j }  Absolute deadline of job j of task τ _{ i } 
\(S_{i}=\left [s_{i}^{1}, s_{i}^{2}, \ldots, s_{i}^{n_{i}}\right ]\)  Sequence of n _{ i } segments, \(n_{i} \in \mathbb {N}\) 
\(s_{i}^{k}\)  Segment k∈S _{ i } 
\(t_{s_{i}^{k}}^{q}\)  Subtask q belonging to segment \(s_{i}^{k}\) 
v _{ k }  Number of subtasks belonging to segment \(s_{i}^{k}\) 
\(e_{s_{i}^{k}}^{q}\)  Execution time of subtask \(t_{s_{i}^{k}}^{q}\) 
C _{ i }  Total execution requirement of task τ _{ i } 
P _{ i }  Minimum execution requirement of task τ _{ i } 
\(c_{s_{i}^{k}}\)  Worstcase execution time among the subtasks of segment \(s_{i}^{k}\) 
U _{ i }  Utilization factor of task τ _{ i } 
λ _{ i }  Density factor of task τ _{ i } 
U _{ τ }  Total utilization factor of the task set τ 
λ _{ τ }  Total density of the task set τ 
π  Set of m homogeneous cores 
\(\tau ^{\pi _{j}}\phantom {\dot {i}\!}\)  Set of tasks assigned to core π _{ j } 
\(\tau _{\text {NM}}^{\pi _{j}}\)  Subset of nonmigrating tasks assigned to π _{ j } 
\(\tau _{\mathrm {M}^{\pi _{j}}}\)  Subset of migrating tasks assigned to π _{ j } 
k _{ i }  Number of frames of a migrating task τ _{ i } 
H  Least common multiple of the periods of all the tasks in τ 
σ  Jobtocore assignment sequence 
M[i,j]  Matrix of integers of the current jobtocore assignment 
s  Number fo intervals of length k _{ i }·T _{ i } 
\(\ell _{i}^{j}\)  Number of frames out of k _{ i } that were successfully assigned to π _{ j } 
nb _{ i }(t)  Upperbound on the number of nonzero frames in the interval [s·k _{ i }·Ti,t] 
\(w_{i}^{j}\)  Workload in the interval [s·k _{ i }·Ti,t] 
dbf _{ j }(τ _{ i },t)  Demandbound function for task τ _{ i } in π _{ j } in the interval [0,t] 
ϕ _{ n }  Fork point for segment n 
μ _{ n }  Synchronization point segment n 
ω _{ n }  Stealing window n 
slack (ϕ _{ n })  Slack of the job at time instant ϕ _{ n } 
\(d_{s}^{(n)}\)  Intermediate deadline of the n ^{th} parallel segment 
m _{ s }  Number of subtasks spawned in segment n 
\(c_{s_{i}^{(n)}}\)  Worstcase execution time among the tasks that belong to segment n 
\(AV_{\tau _{i}}\)  Gain per task of ApproachS over ApproachNS 
AV _{ τ }  Average gain per task in the task set 
Declarations
Acknowledgements
This work was partially supported by National Funds through FCT/MEC (Portuguese Foundation for Science and Technology) and cofinanced by ERDF (European Regional Development Fund) under the PT2020 Partnership, within project UID/CEC/04234/2013 (CISTER Research Centre); by FCT/MEC and ERDF through COMPETE (Operational Programme ‘Thematic Factors of Competitiveness’), within project(s) FCOMP010124FEDER020447 (REGAIN); by FCT/MEC and the EU ARTEMIS JU within project(s) ARTEMIS/0001/2013  JU grant nr. 621429 (EMC2); by the European Union under the Seventh Framework Programme (FP7/20072013), grant agreement no 611016 (PSOCRATES); and also by FCT/MEC and the ESF (European Social Fund) through POPH (Portuguese Human Potential Operational Program), under PhD grant SFRH / BD / 88834 / 2012.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 RI Davis, A Burns, A survey of hard realtime scheduling for multiprocessor systems. ACM Comput. Surv.43(4), 35–13544 (2011).View ArticleMATHGoogle Scholar
 A Marowka, Parallel computing on any desktop. Commun. ACM. 50:, 74–78 (2007).View ArticleGoogle Scholar
 RD Blumofe, CE Leiserson, Scheduling multithreaded computations by work stealing. J. ACM. 46(5), 720–748 (1999).MathSciNetView ArticleMATHGoogle Scholar
 C Maia, L Nogueira, LM Pinho, in Industrial Embedded Systems (SIES), 2013 8th IEEE International Symposium On. Scheduling parallel realtime tasks using a fixedpriority workstealing algorithm on multiprocessors, (2013), pp. 89–92. doi:10.1109/SIES.2013.6601477.
 JH Anderson, V Bud, UC Devi, in Proceedings of the 17th Euromicro Conference on RealTime Systems, ECRTS ’05. An edfbased scheduling algorithm for multiprocessor soft realtime systems (IEEE Computer SocietyWashington, 2005), pp. 199–208.View ArticleGoogle Scholar
 S Kato, N Yamasaki, Y Ishikawa, in RealTime Systems, 2009. ECRTS ’09. 21st Euromicro Conference On. Semipartitioned scheduling of sporadic task systems on multiprocessors, (2009), pp. 249–258.Google Scholar
 F Dorin, PM Yomsi, J Goossens, P Richard, Semipartitioned hard realtime scheduling with restricted migrations upon identical multiprocessor platforms. CoRR.abs/1006.2637: (2010).Google Scholar
 C Maia, PM Yomsi, L Nogueira, LM Pinho, in Embedded and Ubiquitous Computing (EUC), 2015 IEEE 13th International Conference On. Semipartitioned scheduling of forkjoin tasks using workstealing, (2015), pp. 25–34. doi:10.1109/EUC.2015.30.
 K Lakshmanan, S Kato, RR Rajkumar, in Proceedings of the 2010 31st IEEE RealTime Systems Symposium, RTSS ’10. Scheduling Parallel RealTime Tasks on Multicore Processors (IEEE Computer Society, Washington, 2010), pp. 259–268. http://dx.doi.org/10.1109/RTSS.2010.42.View ArticleGoogle Scholar
 A Saifullah, K Agrawal, C Lu, C Gill, in Proceedings of the 2011 IEEE 32Nd RealTime Systems Symposium, RTSS ’11. Multicore RealTime Scheduling for Generalized Parallel Task Models (IEEE Computer Society, Washington, 2011), pp. 217–226. http://dx.doi.org/10.1109/RTSS.2011.27.View ArticleGoogle Scholar
 V Bonifaci, A MarchettiSpaccamela, S Stiller, A Wiese, in RealTime Systems (ECRTS), 2013 25th Euromicro Conference On. Feasibility analysis in the sporadic dag task model, (2013), pp. 225–233. doi:10.1109/ECRTS.2013.32.
 M Qamhieh, L George, S Midonnet, in Proceedings of the 22Nd International Conference on RealTime Networks and Systems, RTNS ’14. A stretching algorithm for parallel realtime dag tasks on multiprocessor systems (ACM, New York, 2014), pp. 13–131322. doi:10.1145/2659787.2659818. http://doi.acm.org/10.1145/2659787.2659818.Google Scholar
 J Li, K Agrawal, C Lu, C Gill, in RealTime Systems (ECRTS), 2013 25th Euromicro Conference On. Analysis of global edf for parallel tasks, (2013), pp. 3–13. doi:10.1109/ECRTS.2013.12.
 HS Chwa, J Lee, KM Phan, A Easwaran, I Shin, in RealTime Systems (ECRTS), 2013 25th Euromicro Conference On. Global edf schedulability analysis for synchronous parallel tasks on multicore platforms, (2013), pp. 25–34. doi:10.1109/ECRTS.2013.14.
 C Maia, M Bertogna, L Nogueira, LM Pinho, in Proceedings of the 22Nd International Conference on RealTime Networks and Systems, RTNS ’14. Responsetime analysis of synchronous parallel tasks in multiprocessor systems (ACM, New York, 2014), pp. 3–3312. doi:10.1145/2659787.2659815. http://doi.acm.org/10.1145/2659787.2659815.Google Scholar
 B Bado, L George, P Courbin, J Goossens, in Proceedings of the 20th International Conference on RealTime and Network Systems, RTNS. A semipartitioned approach for parallel realtime scheduling (ACM, New York, 2012), pp. 151–160. doi:10.1145/2392987.2393006. http://doi.acm.org/10.1145/2392987.2393006.View ArticleGoogle Scholar
 AK Mok, D Chen, A multiframe model for realtime tasks. Softw. Eng. IEEE Trans.23(10), 635–645 (1997).View ArticleGoogle Scholar
 S Baruah, D Chen, S Gorinsky, A Mok, Generalized multiframe tasks. RealTime Syst.17(1), 5–22 (1999).View ArticleGoogle Scholar
 SK Baruah, AK Mok, LE Rosier, in RealTime Systems Symposium, 1990. Proceedings., 11th. Preemptively scheduling hardrealtime sporadic tasks on one processor, (1990), pp. 182–190.Google Scholar
 J Goossens, P Richard, M Lindström, II Lupu, F Ridouard, in Proceedings of the 20th International Conference on RealTime and Network Systems, RTNS ’12. Job partitioning strategies for multiprocessor scheduling of realtime periodic tasks with restricted migrations (ACM, New York, 2012), pp. 141–150.View ArticleGoogle Scholar
 J Kang, DG Waddington, in Proceedings of the 2012 IEEE International Conference on Embedded and RealTime Computing Systems and Applications, RTCSA ’12. Load Balancing Aware RealTime Task Partitioning in Multicore Systems (IEEE Computer Society, Washington, 2012), pp. 404–407. http://dx.doi.org/10.1109/RTCSA.2012.71.View ArticleGoogle Scholar
 H Aydin, Q Yang, in Parallel and Distributed Processing Symposium, 2003. Proceedings. International. Energyaware partitioning for multiprocessor realtime systems, (2003), p. 9. doi:10.1109/IPDPS.2003.1213225.