As reported in Section 2, there exist various implementations of multicore platforms based on soft-core processors. Instead, this work does not focus only on a specific implementation but it also defines a design flow that addresses the problem to implement a multicore platform on FPGA able to support the OpenMP library and that can be also analyzed by means of a distributed HW profiling system. In particular, the main goal is to support a designer that needs to improve performance of its embedded application. So, starting from functional and non-functional (i.e., in this case, the required execution speed-up) requirements, the main steps of such a flow are shown in Fig. 1.
The entry point is a target application (e.g., a program written in C/C++ code), running on a single core on FPGA and already able to satisfy functional requirements. The first step is related to evaluate if, by means of OpenMP parallelization (over a variable number of cores), it is possible to satisfy the (non-functional) speed-up requirement. Strictly related to this analysis is the identification of the architectural parameters (e.g., cache organization, bus bandwidth) that could have effects on the same requirement. Finally, once such parameters are identified, the last action is related to evaluate their optimal values. All this can be performed by means of a proper system-level simulator. In other words, this step allows to perform a design space exploration with respect to several ways to exploit OpenMP features and different architectural parameters. The second step is related to the effective implementation of the identified multicore architecture on FPGA: starting from the results of the first step (i.e., the multicore architecture and its parameters), all the elements are instantiated and connected on FPGA. The third step is related to the selection and the integration of a monitoring solution able to measure parameters useful to evaluate at run-time actual speed-up of execution on the target system. The fourth step is related to the integration of an operating system and the components needed to provide the support to OpenMP-based applications. The last step is related to requirement validation on the final target. In the next subsections, each step of the design flow is better explained.
3.1 Modeling and simulation
The first step is the modeling and simulation one. It consists of the modeling of the target architecture in order to estimate system performance in the execution of the target application by means of simulation. For this purpose, the modeling of HW/SW elements can be done at different abstraction levels by using block diagrams, UML, SystemC, or other modeling languages. Normally, the accuracy of the results depends on such abstraction level that, unfortunately, can also have effects on simulation time. Moreover, this one depends on the simulator itself and the features of the machine (the host) used to perform the simulations. In the context of this work, VIrtual Parallel platform for Performance Analysis (VIPPE) simulator [22] has been selected. It is an electronic design automation tool for HW/SW simulation that provides a library of multicore platforms that can be extended and allows the simulation of an operating system and the simulation of applications that use OpenMP. It provides run-time simulation statistics such as execution time, cache behavior, and power dissipation. VIPPE relies on platform modeling based on UML/MARTE. By means of this modeling language, it is possible to model the platform, the mapping between tasks and processing cores. Moreover, it is also possible to model an operating system and specific libraries (such as OpenMP). From an operative point of view, after the modeling phase is completed, the target application is compiled by means of LLVM [23] (or a different source compiler, depending on what is supported by the target processor) in order to obtain related assembly instructions. Then, for each assembly instruction, VIPPE considers a cost from the point of view of execution time and energy consumption. Such costs depend on another modeling file that describes the processor under simulation. It contains the list of instructions and the associated costs.
The design space exploration that can be performed by using the simulator is illustrated in Fig. 2. Starting from the speed-up requirement for the target application, it is possible to identify the number of cores, the cache parameters, and the OpenMP clauses needed to satisfy the requirement. Considering also the simulation time, it is worth nothing that VIPPE allows to fully exploit the (hopefully multicore) host machine by making a host-based simulation [24]. In this type of simulation, each target thread is mapped on a host thread. VIPPE adds another thread, called kernel, that communicates with other threads managing the simulation and collecting performance of execution [22]. In conclusion, VIPPE can be used to analyze the correctness of the target application while executed on a multicore platform, to make a design space exploration with respect to architectural parameters, and to evaluate the impact of different choices in the use of OpenMP.
3.2 FPGA implementation
After the modeling and simulation step, it is then possible to implement the platform on FPGA by considering the identified parameters about the multicore architecture. The actual process and the toolchain to be used are strictly dependent on target technologies.
3.3 Monitoring system
Starting from the multicore platform developed in the previous step, in order to evaluate the system speed-up, response time on the real target has to be measured. For this, a run-time monitoring system should be integrated in the final system. As described in Section 2, several options are available. In order to avoid as much as possible introducing overhead in the software execution, the proposed flow is based on a hardware solution. In fact, it exploits a fully customizable and portable distributed HW solution based on the library called AIPHS ([25, 26]): specifically, it is a library of elements to be used to realize a monitoring solution. In fact, it allows to consider architectures based on different soft-processors while changing only few hardware components. Such a choice allows to overcome some limitations of existing approaches. For example, the solution proposed in [21] lacks portability among different soft-processors, while ABACUS, a profiling solution adapted in multicore scenario [20], although represents a smart profiling solution portable among different architectures, presented high area occupation because it is intended to be used during development phases. In this work, the monitoring system to be added in the final multicore architecture is intended to be left in the final platform: so, the hardware overhead has to be kept into account, as will be shown in the next sections.
The customization of a monitoring solution for the proposed LEON3-based implementation is shown in the next section.
3.4 OS and OpenMP support
Once the HW multicore architecture is implemented, by customizing and interconnecting soft-processors as suggested by the simulation results, and the hardware monitoring mechanism is inserted, there is the need to customize an SMP Linux operating system. Such an OS is needed to support OpenMP application: OpenMP is a specification for a set of compiler directives, library routines, and environment variables that can be used to specify high-level parallelism in FORTRAN and C/C++ programs. It instructs the compiler to organize parallel sections of code in a specific manner, and helps to parallelize execution of an application. It is based on a fork-join model. The implementation of OpenMP based on GCC, called libgomp [27], has been selected to provide support to the execution of parallelized applications that use this library. This motivates the need of an SMP Linux distribution. In order to provide libgomp on the target, the porting of the required SW components has to be done by cross-compiling source files and inserting results in the kernel.