The applications are periodic. In one period, all the tasks of the DFG must be executed. In the image processing, for instance, the period is the execution time needed to process one image. The scheduling must occur online at the end of the execution of all the tasks, and when a violation of real-time constraints is predicted. Hence the result of partitioning/scheduling will be applied on the next period (next image, for image processing applications).
Our run-time scheduling policy is dynamic since the execution order of application tasks is decided at run time. For the tasks implemented on the RCU, we assume that the hardware resources are sufficient to execute in parallel all hardware tasks chosen by the partitioning step. Therefore the only condition for launching their execution is the satisfaction of all data dependencies. That is to say, a task may begin execution only after all its incoming edges have been executed.
For the tasks implemented on the software processors, the conditions for launching are the following.
(1)The satisfaction of all data dependencies.
-
(2)
The discharge of the software unit.
Hereby the task can have four different states.
-
(i)
Waiting.
-
(ii)
Running.
-
(iii)
Ready.
-
(iv)
Stopped.
The task is in the waiting state when it waits the end of execution of one or several predecessor tasks. When a software processing unit has finished the execution of a task, new tasks may become ready for execution if all their dependencies have been completed of course.
The task can be stopped in the case of preemption or after finishing its execution.
The states of the processing units (SW, SL, and HW) in our target architecture are: execution state, reconfiguration state or idle state.
In the following, we will explain the principle of our approach as well as a hardware implementation of the proposed HW/SW scheduler.
4.1. Description of the Scheduling Algorithm
As explained in Algorithm 1, the basic idea of our heuristic of scheduling is to take decision of tasks priorities according to three criteria.
Algorithm 1: Principle of our scheduling policy.
For all Software tasks do
Comput ASAP
Task with minimum ASAP will be chosen
If (Equality of ASAP)
Compute Urgency
Task with maximum urgency will be chosen
If (Equality of Urgency)
Compare Execution time
Task with maximum execution time will be chosen
The first criterion is the As Soon As Possible (ASAP) time. The task which has the shortest ASAP date will be launched first.
The second criterion is the urgency time: the task which has the maximum of urgency will have priority to be launched before the others. This new criterion is based on the nature of the successors of the task. The urgency criterion is employed only if there is equality of the first criterion for at least two tasks. If there is still equality of this second criterion we compare the last criterion which is execution time of the tasks. We choose the task which has the upper execution time to launch first.
We use these criteria to choose between two or several software tasks (on the Master or on the Slave) for running.
4.1.1. The Urgency Criterion
The urgency criterion is based on the implementation of tasks and the implementations of their successors. A task is considered as urgent when it is implemented on the software unit (Master or slave) and has one or more successor tasks implemented on other different units (hardware unit or software unit).
Figure 3 shows three examples of DFG. In Figure 3(a) task C is implemented on the Slave processor and it is followed by task D which is implemented on the RCU. Thus the urgency (Urg) of task C is the execution time of its successor (Urg (C)
13). In example (b) it is the task B which is followed by the task D implemented on a different unit (on the Master processor). In the last example (c) both tasks B and C are urgent but task B is more urgent than task C since its successor has an execution time upper than the execution time of the successor of task C.
When a task has several successors with different implementations, the urgency is the maximum of execution times of the successors.
In general case, when the direct successor of task A has the same implementation as A and has a successor with a different implementation, then this last feedbacks the urgency to task A.
We show the scheduling result for case (a) when we respect the urgency criterion in Figure 3(d) and otherwise in Figure 3(e).We can notice for all the examples of DFG in Figure 3 that the urgency criterion makes a best choice to obtain a minimum total execution time. The third criterion (the execution time) is an arbitrary choice and has very rarely impact on the total execution time.
We can notice also that our scheduler supports the dynamic creation and deletion of tasks. These online services are only possible when keeping a fixed structure of the DFG along the execution. In that case the dependencies between tasks are known a priori. Dynamic deletion is then possible by assigning a null execution time to the tasks which are not active. and dynamic creation by assigning their execution time when they become active.
This scheduling strategy needs an online computation of several criterions for all software tasks in the DFG.
We tried first to implement this new scheduling policy on a processor. Figure 4 shows the computation time of our scheduling method when implemented on an Intel Core 2 Duo CPU with a frequency of 2.8 GHz and 4 Go of RAM. We can notice that the average computation time of the scheduler is about 12 milliseconds for an image. These experiments are done on an image processing application (the DFG depicted on Figure 12) whose period of processing by an image is 19 milliseconds. So the scheduling (with this software implementation) takes about 63% of a one image processing computation time on a desktop computer.
We can conclude that, in an embedded context, a software implementation of this strategy is thus incompatible with real-time constraints.
We describe in the following an optimized hardware implementation of our scheduler.
4.2. Hardware Scheduler Architecture
In this section, we describe the proposed architecture of our scheduler. This architecture is shown in Figure 5 for a DFG example of three tasks. It is divided in four main parts.
-
(1)
The DFG_IP_Sched (the middle part surrounded by a dashed line in the figure).
-
(2)
The DFG_Update (DFG_Up in the figure).
-
(3)
The MS_Manager (SWTM).
-
(4)
The Slave_Manager (SLTM).
The basic idea of this hardware architecture is to parallelize at the maximum the scheduling of processing tasks. So, at the most (and in the best case), we can schedule all the tasks of the DFG in parallel for infinite resources architecture.
We associate to the application DFG a modified graph with the same structure composed of the IP nodes (each IP represents a task). Therefore in the best case, where tasks are independent, we could schedule all the tasks in the DFG in only one clock cycle.
To parallelize also the management of the software execution times, we associate for each software unit a hardware module:
(i)the Master Task Manager (SWTM in Figure 5),
-
(ii)
the Slave Task Manager (SLTM in the Figure 5).
These two modules manage the order of the tasks executions and compute the processor execution time for each one.
The inputs signals of this scheduler architecture are the following.
-
(i)
A pointer in memory to the implementations of all the tasks. We have three kinds of implementation (RCU, Master, and Slave). With the signals SW and HW we can code these three possibilities.
-
(ii)
The measured execution time of each task (Texe).
-
(iii)
The Clock signal and the Reset.
The outputs signals are the following.
-
(i)
The total execution time after scheduling all tasks (Texe_Total).
-
(ii)
The signal All_Done which indicates the end of the scheduling.
-
(iii)
Scheduled_DFG is a pointer to the scheduling result matrix to be sent to the operating system (or any simple executive).
-
(iv)
The Nb_Task and the Nb_Task_Slave are the number of tasks scheduled on the Master and the number of tasks scheduled on the Slave, respectively. As noted here, these two signals were added solely for the purpose of simulation in ModelSim (to check the scheduling result). In the real case we do not need these two output signals since this information comes from the partitioning block.
The last one is the DFG_Up. This allows updating the results matrix after each scheduling of a task.
In the following paragraphs, we will detail each part of this architecture.
4.2.1. The DFG_IP_Sched Block
In this block there are
components (
is the number of tasks in the application). For each task we associate an IP component which computes the intrinsic characteristics of this task (urgency, ASAP, Ready state, etc.). It also computes the total execution time for the entire graph.
The proposed architecture of this IP is shown in Figure 6 (in the appendix).
For each task the implementation PE and the execution time are fixed, so the role of this IP is to calculate the start time of the task and to define its state. This is done by taking into account the state of the corresponding target (master, slave, or RCU). It then iterates along the DFG structure to determine a total execution ordering and to affect the start time.
This IP calculate also the urgency criterion of critical tasks according to the implementation and the execution time of their successors.
If the task is implemented on the RCU it will be launched as soon as all its predecessors will be done. So the scheduling time of hardware tasks depends on the number of tasks that we can run in parallel. For example, the IP can schedule all hardware tasks that can run in parallel in a one clock cycle.
For the software tasks (on the master or on the slave) the scheduling will take one clock cycle per task. Thus the computing time of the hardware scheduler only depends on the result of the HW/SW partitioning.
4.2.2. The DFG_Update Block
When a DFG is scheduled the result modifies the DFG into a new structure. The DFG_Update block (Figure 7 in the appendix) generates new edges (dependencies between tasks) after scheduling in objective to give a total order of execution on each computing unit according to the scheduling results.
We represent dependencies between tasks in the DFG by a matrix where the rows represent the successors and the columns represent the predecessors. For example, Figure 8 depicts the matrix of dependencies corresponding to the DFG of Figure 2. After scheduling, the resulting matrix is the update of the original one. It contains more dependencies than this later. This is the role of the DFG_Update block.
4.2.3. The MS_Manager Block
The objective of this module is to schedule the software tasks according to the algorithm given above. Figure 9 in the appendix presents the architecture of the Master Manager bloc. The input signal ASAP_SW represents the ASAP times of all the tasks. The Urgency_Time signal represents the urgency of each task of the application. The SW_Ready signal represents the Ready signals of all the software tasks.
The Signal MIN_ASAP_TASKS represents all the tasks "Ready" and having the same minimum values of time ASAP.
The signal MAX_CT_TASKS represents all the tasks "Ready" and having the same maximum of urgency. The tasks which have the two preceding criteria will be represented by the Tasks_Ready signal. The Task_Scheduled signal determines the only software task which will be scheduled. With this signal, it is possible to choose the good value of signal TEXE_SW and to give the new value of the SW_Total_Time signal thereafter. A single clock cycle is necessary to schedule a single software task.
By analogy the Slave_Manager block has the same role as the SW_Manager block. From scheduling point of view there is no difference between the two processors.
4.3. HW/SW Scheduler Outputs
In this section, we describe how the results of our scheduler are processed by a target module such as an executive or a Real-Time Operating System (RTOS). As depicted in Figure 8, the output of our run-time HW/SW scheduler is
matrix where "
" is the total number of tasks in the DFG. Figure 10 shows the scheduling result of the DFG depicted in Figure 12. This matrix will be used by a centralized Operating System (OS) to fill its task queues for the three computing units.
The table shown in Figure 11 is a compilation of both the results of the partitioning and scheduling operations.
The OS browses the matrix row by row. Whenever it finds a "1" it passes the task whose number corresponds to the column in the waiting state. At the end of a task execution the corresponding waiting tasks on each units will become either Ready or Running.
A task will be in the Ready state only when all its dependencies are done and that the target unit is busy. Thus there is no Ready state for the hardware tasks.
It should be noted that if the OS runs on the Master processor, for example, this later will be interrupted each time to execute the OS.