In this section, the proposed method is presented and applied to evaluating software applications running on the NXP LPC2106, an ARM7TDMI-S-based architecture [19].
4.1. Architecture Modeling
The LPC2106 has 128 kB of on-chip FLASH and 64 kB of on-chip SRAM. It has an ARM7TDMI-S processor which enables system designers to build embedded devices requiring small size, low power, and high performance. Such processor is a 32-bit RISC architecture that consists of a program control unit, an address generator, an integer data path, a general-purpose register bank, and a 3-stage pipeline. An important characteristic of the LPC2106 is an instruction prefetch module, known as Memory Accelerator Module (MAM). The MAM is connected to the local bus and is placed between the FLASH memory and the ARM7TDMI-S core. Like a cache, the MAM attempts prefetch the next instruction from the FLASH memory in time to prevent CPU fetch stalls.
In order to model the LPC2106 architecture, a library of generic blocks of CPN models has been constructed. These blocks can be combined in a bottom-up manner to model sophisticated behaviors. Modeling a complex architecture thus becomes a relatively simple process. The proposed CPN models are high-level representations that focus on what the architecture should perform instead of on how it is implemented. Moreover, it is important to stress that once constructed, a building block can be reused in other platform models.
Figure 2 presents the highest-level view of the model, which is composed of the following building blocks (pages, see Section 3.2): flash memory, ram memory, fetch/decode, and execute. The fetch/decode and execute blocks model, respectively, the first two and the last stages of the LPC2106's pipeline. Between these two blocks, there is a place (control), which controls the flow of instructions through the pipeline (see Figure 1). The marking of place control represents the set of available functional units. The ram memory block models the SRAM memory, and the flash memory block, the FLASH memory. In these models, timing information is expressed in cycles and is represented through transition firing delays. The energy consumption is expressed in nJ units and is modeled through the addEnergy function, which adds the specified energy consumption to the global simulated consumption. Information regarding time and energy consumption was assessed through measurements using the AMALGHMA platform (see Section 4.4) as well as from LPC2106 datasheet [20] and ARM7TDMI-S reference manual [19].
Except for two differences, the fetch/decode block is equal to the model presented in Figure 1. Since, in LPC2106 architecture, the FLASH memory stores the application code, the first difference is that the fetch stage is now connected to the flash memory block. Figure 3(a) shows this connection and Figure 3(b) shows the flash memory block. If the data to be fetched is available in the MAM latches (c = true), no flash access is required. Otherwise (c = false) one flash access is required and, thus, the respective energy consumption must be computed. The function mamaccess returns a Boolean value. Given the hit ratio of the application under evaluation, firstly it generates a random number with uniform distribution between 0 and 1 and then compares it to the MAM hit ratio. If the random number is less or equal to the hit ratio, this function returns true, or false, otherwise. Accesses to the FLASH memory stall the pipeline, causing the introduction of pipeline bubbles in the wake of the stalled instruction (see the output arc of transition f2). The bubbles pass through all stages of the pipeline like any other instruction and then are discarded in the last pipeline stage. The MAM miss ratio must be provided to define the evaluation scenario. To obtain this information, a simple trace-driven simulator was implemented for supporting the estimation of miss ratios related to specific instruction patterns. This simulator receives as input a trace of the executed instruction addresses and reports the estimated MAM miss ratio.
The second difference is that there are additional transitions in the fetch/decode block that are responsible for exchanging the instructions in the fetch and decode stages for bubbles. These transitions become enabled when place control receives a token with colour flush, generated by the execute block when it simulates a branch instruction.
The LPC2106 instruction set has been divided into five classes of instructions according to their performance and energy consumption characteristics: load, store, conditional branch, unconditional branch, data operations, and multiply. For each instruction class, the execute block defines the next states and what should be done on the way from one state to another. Figure 4 shows an excerpt of the execute block. As can be seen, depending on its class, the instruction may take one of the paths described in the model and the correspondent delay and energy consumption computed. At decode stage, dec function (see Figure 1) classifies instructions into one of the instructions classes. This function returns a value of type INSTRUCTION in a probabilistic way, such that if an instruction of class c1 is executed with a frequency of 50% in the code to be evaluated, this function will return an INSTRUCTION value of class c1 with probability of 50%.
4.2. Workload Specification
As stated earlier, the dec function generates instructions according to the frequency in which each instruction class is executed in the application under evaluation. Since this frequency distribution is dependent on a given software and input data, we devised a method for capturing this information. The method consists in mapping the application code (with annotations) into a DTMC. More specifically, the Control Flow Graph (CFG) of the application is mapped into an irreducible DTMC.
Each basic block
in the CFG is mapped into a state
in the DTMC. Similarly, control flow edges are mapped as transitions between states and are labeled by the state transition probabilities, as
which defines the probability of executing
after
. Such probabilities are obtained from annotations in the application code.
Figure 5(a) shows an example of code, in which annotations are comments. In this example, the annotation at line 4 indicates that the expression
evaluates to true with a probability of 50%. The annotation at line 6 indicates that the iterative structure is executed 9 times. The values for the annotations may be captured, for instance, from (i) ad hoc designer knowledge, (ii) a more abstract system model, and/or (iii) extensive profiling. Several execution scenarios can be evaluated by simply changing these values. Figure 5(b) depicts the resulting DTMC, where the reader should note an additional transition from state 5 to state 1 (i.e., from the final to the starting point of the application), which is added to make the DTMC irreducible.
The objective in such a mapping is to compute the average number of times each basic block in the CFG executs (visiting number). Given the stationary probability vector
of the mapped DTMC, which is obtained numerically by the SHARPE tool, let
be the vector with the average number of executions of each basic block
, where
contains the ending point of the application. Then,
is determined by (see (3))
Given the average number each basic block executs, the frequency in which each instruction is executed can be obtained, and hence the execution frequency of each class.
The methodology flow for the estimation of the energy consumption and execution time in an architecture for a given application is shown in Figure 6. The architectural model is constructed by the composition of CPN building blocks (right side of Figure 6). The building blocks represent functional units of the architecture under evaluation and are modeled in a high abstraction level, allowing flexibility, reuse, and rapid evaluation. These building blocks are annotated with values regarding energy consumption (addEnergy function) and performance (CPN delays) of the modeled functional unit. Next, the code which will execute on the embedded platform is mapped on the architectural model by a compiler (see Section 4.5). Finally, the model evaluation is made by means of stochastic simulation (Section 4.3).
4.3. Evaluation
The evaluation is made by means of simulation. The facilites of CPN Tools have been adopted to define analysis functions and to perform data collection. Basically, two performance metrics were defined: (i) the average execution time per instruction and (ii) the average energy consumption per instruction. Given these metrics and the number of executed instructions in the application and the processor's operating frequency, the overall energy consumption and execution time of an application is obtained.
Firstly, a breakpoint monitor [8] was defined and assigned to the last transition in the execute block. This transition is always fired by all instruction classes. The breakpoint monitor collects data and tests if the metrics satisfy the stop criterion. If so, the simulation stops; otherwise, the simulation continues. To calculate the metrics, two data are collected on the firing of the transition linked with the breakpoint monitor: (i) the interval firing time, that is, the current time minus the last firing time, and (ii) the interval energy consumption, that is, the current global energy consumption minus the global energy consumption of the last firing. We designed a set of statistical functions so that a confidence interval for the metrics could be constructed. The stop criterion defines that if the confidence interval of these two metrics satisfies the specified precision, the simulation stops. The precision is specified by two parameters: (i) the confidence level and (ii) the relative error. This work adopted a confidence level of 95% and a maximum relative error of 2%.
4.4. Measuring Strategy
This section describes the measuring method adopted to obtain the energy consumption and execution time values employed in the proposed models. To capture the average energy consumption of each functional unit defined in the model, assembly codes that stimulate, separately, the respective functional unit of the LPC2106 have been implemented, uploaded on the platform, executed, measured, and then the obtained data were statistically analyzed. For example, to capture the average power consumption when a MAM miss occurs, an assembly code that forces MAM misses was designed.
The AMALGHMA (Advanced Measurement Algorithms for Hardware Architectures) tool has been implemented for automating the measuring activities. AMALGHMA adopts a set of statistical methods, such as bootstrap and parametric methods, which are important in the measurement process due to several factors, for instance, (i) oscilloscope resolution and (ii) resistor error. Besides, the results estimated by AMALGHMA were compared and validated considering LPC2106 datasheet as well as ARM7TDMI-S reference manual.
The measurement scheme is shown in Figure 7. To measure power consumption, a workstation executing the AMALGHMA tool is connected to an Agilent DSO03202A oscilloscope, which captures the platform-drained current by measuring the voltage drop across a 1 Ohm sense resistor (average microcontroller impedance is order of magnitude higher than this). The oscilloscope is also connected to an I/O port of the LPC2106, which is used to monitor the code's starting and end times. Given this, the code's execution time is also estimated. Even for very short duration software functions, the AMALGHMA tool is able to estimate the average execution time, its energy consumption, and other related statistics. For doing that, sampling and statistics strategies have been implemented [21].
4.5. PECES Tool
An additional contribution of this work was the development of a computational tool to automate same steps of the proposed methodology. The tool was named PECES (Performance and Energy Consumption Evaluation of Embedded Systems). It receives the annotated source code and the architecture model as input and returns the average execution time and energy consumption as output.
The following steps are performed by PECES to evaluate a code.
(1)It compiles the application source code using the option to generate intermediate assembly code. GCC (arm-uclibc-gcc [22]) has been adopted as compiler.
(2)PECES builds the Control Flow Graph (CFG) using the intermediate code generated in the previous step.
(3)It uses the CFG and the annotations from the source code to generate the corresponding irreducible DTMC.
(4)The DTMC is numerically evaluated in SHARPE, so as to obtain the stationary probabilities.
(5)It uses the stationary probabilities to calculate the average number of execution for each basic block and, then, the number of times each instruction is executed. Next, PECES clusters instructions from the same class and calculates the frequency each class executs.
(6)The distribution frequency is written in the architecture model.
(7)PECES invokes Access/CPN tool [23] to simulate the architecture model.
(8)Finally, the tool uses the average execution time per instruction and the average energy consumption per instruction obtained from the previous step to calculate the average execution time and the energy consumption.