Flexibility and target architecture are two major criteria for any implementation. First, a decision has been taken upon building our implementation using a high level description model/language. Modelling at a high-level of description would lead to quicker simulation, and better bandwidth estimation, better functional validation, and for most it can help delaying the system orientation and thereafter delaying the hardware target.
5.1. SystemC Description
C++ implements Object-Orientation on the C language. Many Hardware Engineers may consider that the principles of Object-Orientation are fairly remote from the creation of Hardware components. Nevertheless, Object-Orientation was created from design techniques used in Hardware designs. Data abstraction is the central aspect of Object-Orientation which can be found in everyday hardware designs with the use of publicly visible "ports" and private "internal signals". Moreover, component instantiation found in hardware designs is almost identical to the principle of "composition" used in C++ for creating hierarchical design. Hardware components can be modelled in C++, and to some extent, the mechanisms used are similar to those used in HDLs. Additionally C++ provides inheritance as a way to complement the composition mechanism and promotes design reuse.
Nonetheless, C++ does not support concurrency which is an essential aspect of systems modelling. Furthermore, timing and propagation delays cannot easily expressed in C++.
SystemC [18] is a relatively new modeling language based on C++ for system level design. It has been developed as standardized modeling language for system containing both hardware and software components.
SystemC class library provides necessary constructs to model system architecture from reactive behaviour, scheduling policy, and hardware-like timing. All of which are not available using C/C++ standalone languages.
There is multiple advantages of using SystemC, over a classic hardware description languages, such as VHDL and Verilog: flexibility, simplicity, simulation time velocity, and for most the portability, to name a few.
5.2. SystemC Implementation for Functional Validation and Verification
The SystemC approach consists of a progressive refinement of specifications. Therefore, a first initial implementation was done using an abstract high-level timed functional representation.
In this implementation, we used the proposed parallel structure discussed in Section 4.
This modeling consists of high-level SystemC modules (TLM) communicating with each other using channels, signals, or even memory-blocks modules written in SystemC (Figure 10). Scheduling and timing were used but have not been explored for hardware-like purposes. Data types, used in this modelling, are strictly C++ data types.
As for the cascade/classifiers, we chose to use the database found on Open Computer Vision Library [17] (OpenCV). OpenCV provides the most used trained cascade/classifiers datasets and face-detection software (Haar-Detector) today, for the standard prototype of Viola-Jones algorithm. The particular classifiers, used on this library, are those trained for a base detection window of
pixels, using Adaboost. These classifiers are created and trained, by Lienhart et al. [19], for the detection of upright front face detection. The detection rate of these classifiers is between 80% and 92%, depending on the images Database.
The output of our implementation is the addresses of the sub-windows which contain, according to the detector, an object of particular type (a face in our case). Functional validation is done by simulation (Figure 11). Then, multiple tests were done, including visual comparisons on a dataset of images, visual simulation signals, and other tests that consist of comparing the response of each classifier with its correspondent implemented on OpenCV's Haar-Detector software. All of these tests indicate that we were able to achieve the same result in detection rate as using the software provided by OpenCV. The images, used in these tests, were taken from the CMU+MIT face databases [20].
The choice of working with faces, instead of other object types, can help the comparison with other recent works. However, using this structure for other object-type detection is very feasible, on the condition of having a trained dataset of classifiers for the specific object. This can be considered a simple task, since OpenCV also provides the training software for the cascade detector. Even more, classifiers from other variant of boosting can be implemented easily, since the structure is written in a high-level language. As a result, changing the boosting variant is considered a minor modification since the architecture of the cascade detector should stay intact.
5.3. Modelling for Embedded Implementation
While the previous SystemC modelling is very useful for functional validation, more optimization should be carried out in order to achieve a hardware implementation. Indeed, SystemC standard is a system-level modelling environment which allows the design of various abstraction levels of systems. The design cycle starts with an abstract high-level untimed or timed functional representation that is refined to a bus-cycle accurate and then an RTL (Register Transfer Level) hardware model. SystemC provides several data types, in addition to those of C++. However these data types are mostly adapted for hardware specification.
Besides, SystemC hardware model can be synthesizable for various target technologies. Numerous behavioural synthesis tools are available on the market for SystemC (e.g., Synopsys Cocentric compiler, Mentor Catapult, SystemCrafter, and AutoESL). It should be noted that for, all those available tools, it is necessary to refine the initial simulatable SystemC description in order to synthesize into hardware. The reason behind is the fact that SystemC language is a superset of the C++ designed for simulation.
Therefore, a new improved and foremost a more refined "cycle accurate RTL model" version of the design implementation was created.
Our design is split into compilation units, each of which can be compiled separately. Alternatively, it is possible to use several tools for different parts of your design, or even using the partition in order to explore most of the possible parallelism and pipelining for more efficient hardware implementation. Eventually, the main block modules of the design were split into a group of small modules that work in parallel and/or in pipelining. For instance, the module BLOCK1 contains tree compilation units (modules): a "Decision" Module which contains the first stage's classifiers. This module is used for computation and decision on each sub-window. The second module is "Shift-and-Scale" used for shifting and scaling the window in order to obtain all subsequent locations. Finally, a "Memory-Ctrl" module manages the intermediate memory access.
As result, a SystemC model composed of 11 modules (Figure 12): tree for BLOCK1, two for BLOCK2, two for BLOCK3, one for the Integral image transformation, two for the SRAM simulation, and one for the SDRAM intermediate memory (discussed later in this chapter).
Other major refinements were done: divisions were simplified in order to be power of two divisions, dataflow model was further refined to a SystemC/C++ of combined finite state-machines and data paths, loops were exploited, and timing and scheduling were taken into consideration. Note that, in most cases, parallelism and pipelining were forced manually. On the other hand, not all the modules were heavily refined; for example, the two module of SRAM were used in order to simulate a physical memory, which will never be synthesized no matter what the target platform is.
5.4. Intermediate Memory
One of the drawbacks of the proposed parallel structure (given in Section 4) is the use of additional intermediate memories (unnecessary in the software implementation). Logically, an interblocks memory unit is formed out of two memories working in ping-pong.
A stored address should hold the position of a particular sub-window and its scale; there is no need for two-dimensional positioning, since the Integral Image is created as a monodimensional table for a better RAM storage.
For a
image and an initial mask's size of
, a word of 20 bits would be enough to store the concatenation of the position and the scale of each sub-window.
As for the capacity of the memories, a worse case scenario occurs when half of the possible sub-windows manage to pass through first block. That leads to around 50 000 (50% of the sub-windows) addresses to store. Using the same logic on the next block, the total number of addresses to store should not exceed the 75 000. Eventually, a combined memory capacity of less than 192 Kbytes is needed.
Even more, the simulation of our SystemC model shows that even when facing a case of consecutive positive decisions for a series of sub-windows, access onto those memories will not occur more than once every each 28 cycles (case of mem.1 and mem.2 ), or once each 64 cycles (case of mem.3 and mem.4).
Due to these facts, we propose a timesharing system (shown in Figure 13) using four memory banks, working as a FIFO block, with only one physical memory. Typical hardware implementation of a 192 Kbytes SDRAM or DDRAM memory, running on a frequency of at least 4 times the frequency of the FIFO banks, is necessary to replace the four logical memories.
SystemC simulation shows that 4 Kbits is enough for each memory bank. The FIFOs are easily added using SystemC own predefined sc_fifo module.