3.1. Specifications of an FGDRA with OS Support
We have designed an FGDRA with OS support following those specifications.
It should first address the problem of the configuration speed of a task. This is one of the primary concerns because if the system spend more time configuring itself than actually running tasks its efficiency will be poor. The configuration speed will thus have a big impact on the scheduling strategy.
In order to enable more choice on scheduling scheme, and to match some real time requirements, our FGDRA platform must also include preemption facilities. For the same reasons as configuration, the speed of context saving and restoring processes will be one of our primary concerns. On this particular point, previous work we have discussed in Section 2 will be adapted and reused.
Scheduling on a classical microprocessor is just a matter of time. The problem is to distribute the computation time between different tasks. In the case of an FGDRA the system must distribute both computation time and computation resources. Scheduling in such a system is then no more a one-dimensional problem, but a three-dimensional one. One dimension is the time and the two others represent the surface of reconfigurable resources. Performing an efficient scheduling at run time for minimizing processing time is then a very hard problem that the FGDRA should help getting close to solve. The primary concern on this subject is to ensure an easy task relocation. For that, the reconfigurable logic core should be splited into several equivalent blocks. This will allow to move a task from one block to any another block, or from a group of blocks to another group of blocks of the same size and the same form factor, without any change on the configuration data. The size of those blocks would be a tradeoff between flexibility and scheduling efficiency.
Another aspect of an operating system is to provide intertask communication services. In our case we will distinguish two cases. First the case of a task running on top of our FGDRA and communicating with another task running on a different computing unit. This last case will not be covered here as this problem concern a whole heterogeneous platform, not only the particular FGDRA computing units. The second case is when two, or more, tasks run on top of the same FGDRA communicate together. This communication channel should remain the same wherever the task is placed on the FGDRA reconfigurable core and whatever the state of those tasks is (running, pending, waiting,
). That means that the FGDRA platform must provide a rationalized communication medium including exchange memories.
The same arguments could also be applied to inputs/outputs. Here again two cases exists; first the case of I/O being a global resource of the whole platform; second the case of special I/O directly bounding to the FGDRA.
3.2. Proposed Solutions
Figure 1 shows a global view of OLLAF, our original FGDRA designed to support efficiently OS services like preemption or configuration transfers.
In the center, stands the reconfigurable logic core of the FGDRA. This core is a dual plane, an active plane and a hidden one, organized in columns. Each column can be reconfigured separately and offers the same set of services. A task is mapped on an integer number of columns. This topology as been chosen for two reasons. First, using a partial reconfiguration by column transforms the scheduling problem into a two-dimensional problem (time + 1D surface) which will be easier to handle for minimizing the processing time. Secondly as every column is the same and offers the same set of services, tasks can be moved from one column to another without any change on the configuration data.
In the figure, at the bottom of each column you cannotice two hardware blocks called CMU and HCM. The CMU is an IP able to manage automatically task's context saving and restoring. The HCM standing for Hardware Configuration Manager is pretty much the same but to handle configuration data is also called bitstream. More details about this controller can be found in [1]. On each column a local cache memory named LCM is added. This memory is a first level of cache memory to store contexts and configurations close to the column where it might most probably be required. The internal architecture of the core provides adequate materials to work with CMU and HCM. More about this will be discussed in the next section.
On the right of the figure stands a big block called "HW Sup + HW RTK + CCR". This block contains a hardware supervisor running a custom real time kernel specially adapted to handle FGDRA related OS services and platform level communication services. In our first prototype presented here, this hardware supervisor is a classical 32 bits microprocessor. Along with this hardware supervisor a central memory is provided for OS use only. Basically this memory will store configurations and contexts of every task that may run on the FGDRA. This supervisor communicates with all columns using a dedicated control bus. The hardware supervisor can initiate context transfers, from and to the hidden plane, by writing in CMU's and HCM's registers through this control bus.
Finally, on top of the Figure 1 you can see the application communication medium. This communication medium provides a communication port to each column. Those communication ports will be directly bound to the reconfigurable interconnection matrix of the core. If I/O had to be bound to the FGDRA they would be connected with this communication medium in the same way reconfigurable columns are.
This architecture has been developed as a VHDL model in which the size and number of columns are generic parameters.
3.3. Logic Core Overview
The OLLAF's logic core is functionally the same as logic fabric found in any common FPGA. Each column is an array of Logic Elements surrounded by a programmable interconnect network. Basic functional architecture of an LE can be seen on Figure 2. It is composed of an LUT and a D-FlipFlop. Several multiplexors and/or programmable inverters can also be used.
All the material added to support OS in the reconfigurable logic core, concern the configuration memories. That mean that in a user point of view, designing for OLLAF is similar to designing for any common FPGA. This also mean that if we want to improve the functionality of those LE the results presented here will not change.
Configuration data and context data (Flipflops content) constitutes two separate paths. A context swap can be performed without any change in configuration. This can be interesting for checkpointing or when running more than one instance of the same task.
3.4. Configuration, Preemption, and OS Interaction
In previous sections an architectural view of our FGDRA has been exposed. In this section, we discuss about the impact of this architecture on OS services. We will here consider the three services most specifically related to the FGDRA:
-
(i)
First, the configuration management service: on the hardware side, each column provides a HCM and a LCM. That means that configurations have to be prefetched in the LCM. The associated service running on the hardware supervisor will thus need to take that into account. This service must manage an intelligent cache to prefetch task configuration on the columns where it might most probably be mapped.
-
(ii)
Second, the preemption service: the same principle must be applicable here as those applied for configuration management, except that contexts also have to be saved. The context management service must ensure that there never exists more than one valid context for each task in the entire FGDRA. Contexts must thus be transferred as soon as possible from LCM to the centralized global memory of the hardware supervisor. This service will also have a big impact on the scheduling service as the ability to perform preemption with a very low overhead allows the use of more flexible scheduling algorithms.
-
(iii)
Finally the scheduling service, and in particular the space management part of the scheduling: it takes advantage of the column topology and the centralized communication scheme. The reconfigurable resource could then be managed as a virtual infinite space containing an undetermined number of columns. The job is to dynamically map the virtual space into the real space (the actual reconfigurable logic core of the FGDRA).
3.5. Context Management Scheme
In [1], we proposed a context management scheme based on a scanpath, a local context memory and the CMU. The context management scheme in OLLAF is slightly different in two ways. First, every context management related material is hardwired. Second, we added two more stages in order to even lower preemption overhead and to ensure the consistency of the system.
As context management materials are added at hardware level and no more at task level, it needed to be splited differently. As the programmable logic core is column based, it was natural to implement context management at columns level. A CMU and a LCM have then been added to each column, and one scanpath is provided for each column's set of flipflops.
In order to lower preemption overhead, our reconfigurable logic core uses a dual plane, an active plane and a hidden plane. Flipflops used in logic elements are thus replaced with two flipflops with switching material. Architecture of this dual plane flipflops can be seen on Figure 3. Run and scan are then no more two working modes but two parallel planes which can be swapped as well. With this topology, the context of a task can be shifted in while the previous task is still running, and shifted out while the next one is already running. The effective task switching overhead is then taken down to one clock cycle as illustrated in Figure 5.
Contexts are transferred by the CMU into LCM in the hidden plane with a scanpath. Because the context of every column can be transferred in parallel, LCM is placed at column level. It is particularly useful when a task uses more than one column. In the first prototype, those memories can store 3 configurations and 3 contexts. LCM optimizes access to a bigger memory called the Central Context Repository (CCR).
CCR is a large memory space storing the context of each task instance run by the system. LCM should then store context of tasks who are most likely to be the next to be run on the corresponding column.
After a preemption of the corresponding task, a context can be stored in more than one LCM in addition to the copy stored in the CCR. In such situation, care must be taken to ensure the consistency of the task execution. For that purpose, contexts are tagged by the CMU each time a context saving is performed with a version number. The operating system keeps track of this version number and also increments it each time a context saving is performed. In this way the system can then check for the validity of a context before a context restoration. The system must also try to update the context copy in the CCR as short as possible after a context saving is performed with a write-through policy.
Dual plane, LCM and CCR form a complex memory hierarchy specially designed to optimize preemption overhead as seen on Figure 4. The same memory scheme is also used for configuration management except that a configuration does not change during execution so it does not need to be saved and then no versioning control is required here. The programmable logic core uses a dual configuration plane equivalent to the dual plane used for context. Each column has an HCM which is a simplified version of the CMU (without saving mechanism). LCM is designed to be able to store an integer number of both contexts and configurations.
In best case, preemption overhead can then be bound to one clock cycle.
A scenario of a typical preemption is presented in Figure 5. In this scenario we consider the case where context and configuration of both tasks are already stored into LCM. Let us consider that a task T1 is preempted to run another task T2, scenario of task preemption is then as follows:
(i)T1 is running and the scheduler decides to preempt it to run T2 instead,
(ii)T2 is configuration and eventual context are shifted on the hidden plane,
(iii)once the transfer is completed the two configuration planes are switched,
(iv)now T2 is running and T1's context can be shifted out to be saved,
(v)T1's context is updated as soon as possible in the CCR.