Energyaware memory management for embedded multidimensional signal processing applications
 Florin Balasa^{1}Email author,
 Noha Abuaesh^{1},
 Cristian V. Gingu^{2},
 Ilie I. Luican^{3} and
 Hongwei Zhu^{4}
https://doi.org/10.1186/s1363901600439
© Balasa et al. 2016
Received: 22 September 2015
Accepted: 9 July 2016
Published: 25 July 2016
Abstract
In realtime dataintensive multimedia processing applications, data transfer and storage significantly influence, if not dominate, all the major cost parameters of the design space—namely energy consumption, performance, and chip area. This paper presents an electronic design automation (EDA) methodology for the highlevel design of hierarchical memory architectures in embedded dataintensive applications, mainly in the area of multidimensional signal processing. Different from the previous works, the problems of data assignment to the memory layers, of mapping the signals into the physical memories, and of banking the onchip memory are addressed in a consistent way, based on the same formal model. This memory management framework employs techniques specific to the integral polyhedra based dependence analysis. The main design target is the reduction of the static and dynamic energy consumption in the hierarchical memory subsystem.
Keywords
Memory management Multidimensional signals Signaltomemory mapping Scratchpad memory banking Polytopes and lattices1 Introduction
In embedded realtime communication and multimedia processing applications, the manipulation of large data sets has a major effect on both power consumption and performance of the system. Due to the significant amount of data transfers between the processing units and the large and energy consuming offchip memories, these applications are often called datadominated or dataintensive [1].
At system level, the power cost can be reduced (and, at the same time, the system performance enhanced) by introducing an optimized custom memory hierarchy [2]. Hierarchical memory organizations reduce energy consumption by assigning the frequentlyaccessed data to the low hierarchy levels [3], diminishing the dynamic energy consumption—which expands due to memory accesses. Moreover, it reduces the static energy consumption as well, since this decreases monotonically with the memory size [4].
Within a given memory hierarchy level, power can be reduced by memory partitioning—which principle is to divide the address space in several smaller blocks, and to map these blocks to physical memory banks that can be independently enabled and disabled [5, 6].
The most typical implementation of memory hierarchies makes use of caches. While extremely versatile and fast, caches are not always the best choice in embedded systems. As onchip storage, the scratchpad memories (SPMs)—compilercontrolled synchronous randomaccess memories (SRAMs), more energyefficient than the hardwaremanaged caches—are widely used in embedded systems, where caches incur a significant penalty in aspects like area cost, energy consumption, hit latency, and realtime predictability [3].
The research on the assignment of signals (multidimensional arrays) to the memory layers [7] focused in part on how to restructure the application code to make better use of the available memory hierarchy [8]. Brockmeyer et al. used the steering heuristic of assigning the arrays having the highest access number over size ratio to the cheapest memory layer, followed by incremental reassignments [9]. Their model takes into account the relative lifetime differences between arrays and between the scalars covered by each array. However, their model operates on entire arrays, not taking into account that the access patterns are, in general, not uniform.
There are rather few research works addressing the problem of signal mapping to the physical memory. De Greef et al. mapped each multidimensional array from the behavioral specification by choosing the canonical linearization which yielded the minimum distance (in memory words) between array elements simultaneously alive [10].
Instead of a linear mapping, Tronçon et al. proposed to compute an mdimensional mapping window for each mdimensional array [11]: the sides of a window were computed based on the maximal index difference in each dimension between array elements simultaneously alive. (The boundingwindow mapping is also used in PPCG—a sourcetosource compiler using polyhedral compilation techniques that extracts dataparallelism with a code generator for a modern graphics processing unit [12]). Darte et al. proposed a latticebased mathematical framework for array mapping, establishing a correspondence between valid linear storage allocations and integer lattices called strictly admissible [13].
Partitioning of onchip memories has been analyzed by several research teams, being typically used as an additional dimension of the memory design space. Shiue and Chakrabarti studied powerefficient partitioned cache organizations, identifying cache subbanking as an effective approach to reduce cache power consumption [14]. Benini et al. proposed a recursive partitioning of the SPM address space, which achieved a complete exploration of the banking solutions [5]. In [6], the cost function was shown to exhibit properties that allow to apply a dynamic programming paradigm. A leakageaware approach, based on traces of memory accesses, takes into account that putting a memory block into the dormant state should be done only if the cost of energy overhead and decrease of performance can be amortized [15, 16].
The advances in datadependence analysis [17] and optimizing compilers [18] have influenced the development of memory management techniques based on the processing and restructuring of behavioral specifications. Ramanujam et al. use data dependence and data reuse to estimate the minimum amount of memory in signal processing codes and, then, reduce the storage requirement through looplevel transformations [19]. However, their approach focuses only on nested loops, and the window sizes for the arrays are determined using only a single linearization—the one induced by the variation of the iterators in the nested loop. De La Luz et al. present a strategy of increasing the memory bank idleness by modifying the execution order of loop iterations [20]. The number of banks and their sizes seem predetermined, though, and it is not clear what happens when the arrays are large, exceeding the size of the banks.

The data assignment to the memory layers identifies the intenselyaccessed parts of the multidimensional arrays, steering these parts to the energyefficient storage layer. Such a strategy is thus independent of the number and size of the arrays (being dependent only on the size of the energyefficient layer) and entails a significant reduction of energy consumption in the memory subsystem.

The signaltomemory mapping is designed to work in hierarchical memory organizations, being able to operate with parts of arrays (rather than entire arrays). It can provide mapping functions useful for the design of the address generation units and it can evaluate metrics of quality, like the minimum storage requirement of the behavioral specification.

Two memory banking techniques are implemented in our framework: they further reduce the memory energy consumption, computing very fast nearoptimal banking solutions even when the memory address space is large.
The main input of this memory management framework is the behavioral specification of a dataintensive application. Such a specification is described in a highlevel programming language, where the code is typically organized in sequences of loop nests. The loop boundaries are linear functions of the outer loop iterators. The data structures are multidimensional arrays—a characteristic of dataintensive applications [1]; the indices of the array references are linear functions of the loop iterators. The logical expressions in conditional instructions can be either simple or compound. The behavioral specifications describe the processing of streams of data samples: different from computer programs, these specifications can be imagined as surrounded by an implicit loop having time as iterator. This is why the code can contain delayed signals, i.e., signals produced (or inputs) in a previous data processing, which are used as operands during the current execution of the code (for instance, A[ i][ j] @3 means an array reference produced three time iterations in the past).
The rest of the paper is organized as follows. Section 2 discusses the problem of energyaware signal assignment based on a case study. Section 3 presents the formal model of the methodology and the algorithm for data assignment to the memory layers. Section 4 describes a storageefficient mapping approach of multidimensional arrays to the physical memories. Section 5 presents the algorithm for partitioning the onchip SPMs. Section 6 discusses implementation aspects and presents experimental results. Finally, Section 7 summarizes the main conclusions of this research.
2 Signal assignment to the memory layers: a case study
Brockmeyer et al. proposed to assign the arrays having the highest access number over size ratio to the onchip memory layer [9]. Certainly, the array A has a high access ratio—equal to 8320.5 (since there are 545,292,288 accesses to 65,536 array elements): quite obviously, the most desirable scenario – in point of view of both energy consumption and performance—is to store all the signals from the behavioral specification onto the SPM memory layer or, at least, the entire array A. This is usually not possible: quite often, the size of the onchip memory is a design constraint, usually small relative to the storage requirement of the entire code.
Not only that arrays from the behavioral specification may have storage requirements greater than the SPM size, but also their possible nonuniform pattern of accesses is not taken into account by this past assignment approach [9]. Hu et al. can use parts of arrays in the assignment to the memory layers [23]: their illustrative example suggests cuts along one of the array dimensions as the main partitioning heuristic. If this is the case, the approach has a similar shortcoming—the pattern of accesses may have significant variations along these cuts. For instance, in our test case, the Aelements of the row 128 have a range of variation between 128 (for A[ 128][ 0]) and 33,025 accesses (for A[ 128][ 128]), with an abrupt increase from 8192 to 24,961 accesses for the neighbor elements A[ 128][ 63] and A[ 128][ 64] (see Fig. 5).
In dataintensive signal processing applications, the main data structures from the behavioral specification are multidimensional arrays. The problem is how to identify the intenselyaccessed parts of arrays based on the analysis of the application code, in order to steer their assignment to the energyefficient data storage layer– the onchip SPM. Note that a simulated execution of the behavioral specification may be computationally expensive (e.g. when the number of array elements is very large, or when the application code contains deep loop nests); at the same time, such a scalaroriented technique yields assignment results that cannot be directly used for the design of the address generation units [24].
An assignment algorithm mapping the array elements from the behavioral specification to the memory layers, targeting the reduction of the energy consumption in the hierarchical memory subsystem, will be described in the next section.
3 Data assignment to the memory layers for the reduction of energy consumption in the hierarchical memory subsystem
Our technique is based on a simple observation: the most intenselyaccessed parts of the array space of a multidimensional signal are typically covered by more than one array reference. Actually, in many cases, the more array references cover a certain element, the more accessed that element is. For instance, the most heavily accessed parts of array A (see Fig. 5) from the code in Fig. 4 are the Aelements belonging to both array references A[ i][ j] and A[ k][ l]. Of course the intensity of memory accesses to them is not uniform but, nevertheless, they are read more often than the other Aelements. In order to find out which are the array elements belonging to several array references, we must intersect the array references of the signal. This operation is done based on an algebraic model whose principle is briefly explain below.
Note that, in geometrical point of view, the set of inequalities (1) represents an integral polytope—a multidimensional polyhedron bounded and closed, restricted to the points having integer coordinates. Checking the existence of integer solutions of a linear system of inequalities is a wellknown problem [27].
The intersection of lattices described above can be used to decompose the array space of every multidimensional signal from the application code into disjoint lattices and, also, to compute the number of accesses to the array elements in every partition. A highlevel pseudocode is given below: the decomposition is obtained by recursively intersecting all the array references of a selected signal. The structure of each array reference in terms of component lattices is determined by gradually building a directed acyclic graph (DAG)—each node representing an LBL and each arc denoting an inclusion relation between the respective sets. Initially, this DAG is just a set of nodes, one per each array reference in the code. Gradually, new nodes emerge due to intersections between lattices, and arcs (inclusions) are added between the nodes.
The difference of lattices is a more complex operation, described, for instance, in [22]—where it is used to compute the minimum data storage of behavioral specifications.
The benefit of the decomposition of the array space for each signal is that it yields access information for steering the signal assignment to the memory layers. The obvious candidates for being stored onchip are the regions of the array space (LBLs) having the highest ratios between the number of array accesses and their size. Note that Brockmeyer et al. considered similar ratios but at the level of whole arrays [9], whereas our approach localizes the regions heavily accessed in the array space and applies the ratios at the level of these regions. In our illustrative example, the middle region M has the highest ratio: 425,218,048 / 16,384 = 25,953.25 (that is, an average of almost 26 thousand accesses per array element). We are using an even more precise metric—the savings of energy, as a percentage, when the lattice is stored onchip, rather than onto the external DRAM. According to this metric, the energy benefit of lattice M is 60.88 %—computation will be explained below.
Therefore, storing onchip the elements in the center of signal A’s array space would maximize the benefit in term of energy reduction. However, this central region M of the array space requires 16 Kbytes: what is to be done if there is a design constraint limiting the SPM storage to less than 16 Kbytes?
The number of memory accesses for each of the A’s smaller 128 lattices (see Fig. 8) can be also computed: then, the ratios between these numbers and the lattice sizes (of 128 bytes each) decrease from 28,993 (for lattices 127 and 128) to 22,913.5—for the two lateral ones, 64 and 191. Hence, it is more beneficial to store in the SPM the lattices going from the middle to the periphery of the central region of the array space.
When comparing time and energy per access in a memory hierarchy, it may be observed that these two metrics have often similar behavior; namely, they both increase as we move from low to high hierarchy levels. While it sometimes happens that a lowlatency memory architecture is also a lowpower one, optimizing memory performance does not imply power optimization or viceversa [14] (although architectural solutions originally devised for performance optimization can be beneficial in terms of energy consumption, as well). There are two basic reasons for this: first, energy consumption and performance do not increase in the same way with memory size and hierarchy level; second, performance is a worstcase metric, while power is an averagecase metric: for instance, the removal of a critical computation that improves performance may be harmful in terms of power consumption.
Algorithm 2 could be extended to an arbitrary number of memory layers if the functions of energy per access and static power versus memory size were available for each layer. Assuming these functions increase monotonically with the memory size for each layer, and that the value intervals of these functions are disjoint and increase with the hierarchy level, the algorithm can be modified to assign the lattices of larger benefits starting from the lowest level and gradually moving to the higher levels of hierarchy. Our current implementation is dependent on the limitations of CACTI 6.5—the analytical tool used to provide memory information [28].
4 Mapping signals into the physical memory
This design phase decides the memory addresses of the signals from the behavioral specification. The signaltomemory mapping has the following goals: (a) to map the signals (already assigned to the memory layers) into amounts of data storage as small as possible; (b) to guarantee that scalar signals (array elements) simultaneously alive are mapped to distinct storage locations; and (c) to use mapping functions simple enough in order to ensure an address generation hardware of a reasonable complexity.
Different from the previous works [10, 11, 13], this mapping technique is designed to work in hierarchical memory organizations, since it operates with parts of arrays (represented by mutually disjoint lattices) that can be assigned to different physical memories. The polyhedral framework, common to all the design phases in our system (data assignment to the memory layers, signal/array mapping onto the external memory and the SPM, followed by the banking of the latter), entails a high computation efficiency since all the phases rely on similar polyhedral operations. We present below the basic ideas of the mapping approach.
For an mdimensional array, there are m! orderings of the indices. For instance, a 2D array can be typically linearized concatenating the rows, or concatenating the columns. In addition, the elements in a given dimension can be mapped in the increasing or decreasing order of the respective index. All these 2^{ m }·m! possible linearizations are called canonical [10]. For any canonical linearization, we compute for every linearly bounded lattice the largest distance (in memory words) between any two live lattice elements during the code execution. Based on these results, we compute—for every canonical linearization—the largest distance between any two live array elements at any time during the code execution.^{4} This distance plus 1 is then the size of the storage window required for the mapping of the array into the data memory. More formally, W _{ A } = min max{ d i s t(A _{ i },A _{ j }) } + 1, where W _{ A } is the size of the storage window of a signal A, the minimum is taken over all the canonical linearizations, while the maximum is taken over all the pairs of Aelements (A _{ i },A _{ j }) simultaneously alive. Even when parts of the array are stored in the SPM and the rest of it in the offchip memory, the sizes of the storage windows can still be computed, since the assignment of data to the memory layers is done at the level of lattices (as explained in Section 3).
By analyzing the canonical linearizations, we try to reduce the memory window even more. This analysis is based on the evaluation of the distance between the minimum and maximum index vectors, relative to the lexicographic order, in a minimal bounding window of the index space (the computation steps being described and illustrated in [30]). In Fig. 11 b, these minimum and maximum index vectors are represented by the points M and N, and the distance between them is d i s t(M,N) = (11 − 2) × 5 + (7 − 3) = 49. Assuming that all the array elements within a linearly bounded lattice are alive, in a canonical linearization, the maximum distance in words between the array elements is the distance between the (lexicographically) minimum and maximum index vectors, providing an index permutation is applied first (in particular, an index interchange for 2D signals). If in the canonical linearization some dimension is traversed backwards, then a simple transformation reversing the index variation must be also applied. In our example, the interchange of the indexes in Fig. 11 c does not reduce the distance between between the points representing the minimum and maximum index vectors, but the reverse of the first index variation—as shown in Fig. 11 d—entails a distance reduction: d i s t(M,N) = (11 − 2) ×5 + (5 − 6) = 44. It follows that the array reference can be stored without mapping conflicts in a memory window W _{ A } of 45 words: it suffices that any read/write access to A[i][j] be redirected to the memory word, say, W _{ A }[(5∗(13−i)+j) mod 45]. To be sure, 45 words represent an excess of storage relative to the minimum storage requirement of 38 words, but the advantage is that there is an easytodesign function directing the mapping from the index space to the data storage.
Figure 11 e shows another code example, the array elements produced by the array reference A[ i][ j] are consumed by the array reference A[ i−3][ j−2]. The points to the left of the dashed line represent the iterator vectors of the elements produced till the breakpoint indicated in the code, the black points representing the elements still alive (i.e., produced and still used as operands in the next iterations), while the circles representing Aelements already “dead” (i.e., not needed as operands any more). The light grey points to the right of the dashed line represent the index vectors of Aelements still unborn (to be produced in the next iterations). There is a canonical linearization in which the distance between the index vectors of simultaneously alive elements is 17 (which entails a memory window of 18 words), very close to the minimal storage requirement of 17 words. □
The computation of distances are performed for each disjoint lattice extracted from the code [30]. The overall mapping results are assembled, taking into account the lifetimes of lattices, as well as the lifetimes of the array elements they contain.
In order to avoid the inconvenience of analyzing different linearization schemes (whose number grows fast with the signal’s dimension), we also use a second mapping technique based on integer projections: although it often yields slightly worse storage results than the linearization approach, it has the advantage of being faster.
We compute a maximal mdimensional bounding box B B _{ A }=(w _{1},…,w _{ m }) large enough to encompass at any time during code execution the simultaneously alive (mdimensional) Aelements. As already mentioned in Section 1, this boundingbox technique was also used in—a polyhedral parallel code generator for CUDA [12]. An access to the element A[i n d e x _{1}]…[i n d e x _{ m }] can then be redirected without any conflict to the bounding box element B B _{ A }[i n d e x _{1} mod w _{1}]…[i n d e x _{ m } mod w _{ m }].
Each window side w _{ k } is computed as the maximum difference in absolute value between the kth indexes of any two Aelements (A _{ i },A _{ j }) simultaneously alive, plus 1. More formally, w _{ k } = max{ x _{ k }(A _{ i })−x _{ k }(A _{ j }) } + 1, for k=1,…,m. This ensures that any two array elements simultaneously alive are mapped to distinct memory locations. Then, the bounding box B B _{ A } can be mapped onetoone to a memory window W _{ A }. The amount of data memory required for storing the array is the volume of the bounding box B B _{ A }, that is, \(W_{A}\ =\ \Pi ^{m}_{k=1} w_{k}\).
This mapping approach can be independently applied to each memory layer, providing mapping functions for all the signals in the specification and a complete storage allocation/assignment solution for distributed memory organizations. In addition, it can generate the traces of memory accesses for each memory layer, the trace to the SPM being particularly useful for energyaware memory banking (see the next section). Our memory management software also computes the minimum storage requirement of each multidimensional signal in the specification [22] (therefore, the optimal memory sharing between the elements of each array), as well as the minimum data storage for the entire algorithmic specification—therefore, the optimal memory sharing between all the array elements and scalars in the code. These lowerbounds are used as metrics of quality for the mapping solution, since they show how much larger the mapping windows are versus the minimum storage requirements: no prior technique provides such metrics of quality for their mapping solutions.
5 Scratchpad memory banking for the reduction of energy consumption
The first two arguments of \(E_{2}^{dyn}\) are the start addresses in words of the two banks, the third being the total size.
The static energy consumed in the twobank SPM, having the address space partitioned as above, is the sum of the static energies in each bank: \(E_{2}^{st}(0,k,N)=E_{1}^{st}(0,k)+E_{1}^{st}(k,Nk)\). Neither term depends on the number of memory accesses.
The solution space of twoway memory banking can be exhaustively explored (and, hence, optimally solved in energy point of view) by iteratively moving the upper bound k of the first bank from 1 to N−1, and finding the global minimum: \(\min _{k}{\{E_{2}^{dyn}(0,k,N)+E_{2}^{st}(0,k,N)\}}+\Delta E_{12}\).
A similar cost metric can be used to explore multiway banking solutions: any possible partition into M (≥2) banks is defined by a set of M1 addresses identifying the memory bank boundaries. Based on this idea, Benini et al. implemented a recursive algorithm [5] where the solution space is exhaustively explored (their main input is the graph of the distribution of memory accesses to the SPM address space, rather than the behavioral specification of the application). This search for an energeticallyoptimal solution proves to be computationally expensive (see Section 6), even infeasible, for larger values of M—the maximum number of banks—and/or larger values of the SPM size. Angiolini et al. carried out more efficiently a similar exploration using dynamic programming [6]. Although the time complexity is polynomial, our experiments found that the running times of their method exhibit a fast increase with the sizes of the SPM and the execution trace (the computation time was over 8 h for our illustrative example in Fig. 4, and an SPM size of 8 Kbytes, when the memory word was 1 byte. For SPM sizes smaller than 2 Kbytes, the technique can be effective, though).
The banking algorithms we propose are consistent with our model of partitioning the array space of signals into disjoint lattices (see Section 3). For M<4, these algorithms are basically identical to the exploration algorithm presented in [5] since this approach yields optimal solutions. For M≥4, as the running times may be extremely large, we introduce a constraint that significantly reduces the exploration space: no SPMassigned lattice can cross a bank boundary. This constraint ensures the effectiveness of our approach in point of view of speed and nearoptimality of the results—as Example 4 will show.
5.1 Latticebased recursive algorithm
In addition to M, the maximum number of SPM banks, the inputs of the SPM partitioning algorithm are:
Input 1: An array \({\mathcal {A}}=[{addr}_{0},\ {addr}_{1},\ \ldots \,\ {addr}_{n}]\) of ordered addresses such that a linearly bounded lattice L _{ k }, k=1,…,n, assigned to the onchip memory layer be mapped at the SPM successive addresses {a d d r _{ k−1},…, a d d r _{ k }−1}.
Input 2: An array \({\mathcal {R}W}=[{rw}_{1},\ \ldots,\ {rw}_{n}]\) which elements represent the numbers of read/write accesses for each lattice mapped onto the SPM (notice that the numbers of read/write accesses for each lattice mapped onto the SPM are already known from Section 3).
Input 3: An array \({\mathcal {E}}=[\Delta E_{12},\ \Delta E_{23},\ \ldots \, \ \Delta E_{M1,M}]\), which elements Δ E _{ k,k+1} are the energy overheads resulting from moving from an onchip SPM with k banks to one with k+1 banks. The decoding circuitry was synthesized using the ECP family of FPGA’s from lattice semiconductor [31] and, for the energy overheads, we used the power calculator from Lattice Diamond [31].
Output: The energeticallyoptimal SPM partitioning, i.e., an array of SPM addresses delimiting the banks, and the minimum value of the total (static and dynamic) energy consumption for this optimal SPM banking solution.
The algorithm starts from the monolithic architecture and searches for the energeticallyoptimal partitioning of the SPM in no more than M memory banks, such that the borderlines between banks are addresses in the array \({\mathcal {A}}\) (hence, ensuring that any lattice of signals is entirely stored in one bank). A variable crtBestSolution records the set of addresses in \(\mathcal {A}\) corresponding to the most energeticallyefficient partition reached in any moment of the exploration; initially, the SPM being monolithic, this set is {a d d r _{0}, a d d r _{ n }}. A variable crtMinEnergy registers the total energy consumption of the best SPM banking solution encountered during the exploration. A function SPM_energy(bank_size, number_accesses) uses CACTI 6.5 [28] and the number of read/write accesses in order to compute the total energy (both static and dynamic) consumed in a bank of the specified size. A recursive function Multi_Bank, whose first formal parameter m (initially equal to 2) is the current number of banks, searches for the optimal solution such that the first bank ends at a d d r _{ k }. This function is successively called for k=1, 2, …, n−1. EnergyConsumed registers the amount of energy consumed from the start of the SPM till the borderline a d d r _{ k }. If its value exceeds the best energy already recorded (crtMinEnergy), there is no need to continue the exploration since all the next solutions will be energeticallyworse—due to the monotonic increase of the energy consumption with the SPM size.
Outputs: crtBestSolution – an ordered set of SPM addresses from \({\mathcal {A}}\) delimiting the banks, and the corresponding energy consumption crtMinEnergy.
The recursive function Multi_Bank searches for the best banking solution starting from a d d r _{ k } till the end of the SPM at a d d r _{ n }. From the beginning of the SPM (a d d r _{0}) till the address a d d r _{ k } there are already m−1 banks. At the beginning, the function considers [ a d d r _{ k }, a d d r _{ n }] as the mth bank of the SPM: if this banking configuration is energetically better than all previous solutions, it is duly recorded as the best solution reached during the exploration. If the maximum number of banks is not reached yet (m<M), then the function explores solutions with m+1 banks or more, considering the mth bank to be [a d d r _{ k }, a d d r _{ j }], for j=k+1, k+2, …, n−1.
A solution stack is used and typical stack functions (push, pop, top) to record and resume the partial banking solutions. For instance, the push instruction in the body of the Multi_Bank function takes the set of memory addresses on the top of the stack, adds the new element a d d r _{ k } to it, and the new set is pushed back on the stack.
Since the energy cost is monotonically increasing with the SPM size, a backtracking mechanism is incorporated before the recursive call to prevent the search towards more energeticallyexpensive partitions. The output of the algorithm is an array of SPM addresses delimiting the banks, and the corresponding energy consumption.
In addition, for each LBL in the decomposition of the array space, we compute the time intervals (in clock cycles) when the lattice is not accessed. This idleness analysis cannot be done directly in terms of time: first, it is done in terms of loop iterators. For instance, we must determine the iterator vectors in the loop nests when a disjoint lattice is accessed for the first time and for the last time.^{5} Only afterwards, we compute the clock cycles during the code execution corresponding to those iterator vectors. When the recursive function Multi_Bank investigates the case when the mth bank is between A d d r[ k] and A d d r[ n] (the end of the SPM), the idleness intervals of the lattices L _{ k+1}, …,L _{ n } assigned to this mth bank are intersected in order to determine whether there are idleness intervals at the bank level. If this is the case, the bank can be switched to the sleep state during the idleness intervals that are large enough. (A time overhead of one clock cycle for the transition from the sleep to the active state is also applied, in accordance with simulated data on caches reported in [32]). In order to overcome the energy overhead entailed by the transition of a memory bank from the active state into the sleep state and back to the active state, the bank must remain in the sleep state at least a minimum number of clock cycles (otherwise, the economy of static energy is lesser than the energy overhead of the transitions). This idleness threshold in cycles can be estimated; typical values are in the order of hundreds of cycles [16]. So, if the idleness of a bank (resulted from the intersection of the idleness intervals of the lattices assigned to the bank) exceeds the idleness threshold, then the energy cost of the bank is computed taking into account the switches to the sleep state and back. The idleness intervals of each lattice are organized into an interval tree [33] as the depth of this data structure is O(l o g n) for n intervals, and typical interval operations have logarithmic complexity.
A highlevel pseudocode of the recursive function Multi_Bank is given below:
5.2 Latticebased dynamic programming algorithm
The inputs are identical as in the previous algorithm, except the last one:
Input 3: An array \({\mathcal {E}} = [\!0\ \Delta E_{2}\ \Delta E_{3}\ \ldots \ \Delta E_{M}]\) which elements Δ E _{ k } (1<k≤M) are energy overheads resulting from moving from a monolithic SPM to one with k banks (obviously, Δ E _{1}=0). These energy overheads were estimated with the power calculator from Lattice Diamond [31].

2D “cost” array C : each element C[i, j] (0≤i<j≤n) is initialized to the energy consumed by a monolithic SPM having the address space [a d d r _{ i }, a d d r _{ j }) and storing the linearly bounded lattices L _{ i+1}, …,L _{ j } ; in particular, C[0, n] is, initially, the energy consumed by the whole monolithic SPM. At the end of the algorithm, each element C[i, j] will contain the energy consumption after the address space [a d d r _{ i }, a d d r _{ j }) was optimally partitioned—under the constraint that bank boundaries are only addresses from the input array \({\mathcal {A}}\). The additional exploration constraint—that no disjoint lattice assigned to the SPM can cross a bank boundary— ensures the effectiveness of the approach.

2D array m : each element m[i, j] (0≤i<j≤n) is the number of banks in the address space [a d d r _{ i }, a d d r _{ j }); their initial value is 1.

2D array s : used for constructing an optimal partitioning solution.
Note that, if any of the component address spaces is not partitioned, say m[ i,k]=1, then Δ E _{ m[i,k]} = Δ E _{1} = 0. All these energy overheads are elements of the array \({\mathcal {E}}\) (Input 3).
The conditional instruction 13 eliminates the solutions exceeding M number of banks. Arbitrarily fine partitioning is prevented since an excessively large number of small banks is area inefficient, imposing a severe wiring overhead, which also tends to increase communication power and decreaseperformance.
The time complexity of the algorithm is Θ(n ^{3}) due to the three for loops (instructions 8, 9, and 12). The main parameter n is the number of disjoint linearly bounded lattices that are assigned to the SPM; these n lattices represent only a subset of the disjoint LBLs that result from the partitioning of the multidimensional signals from the behavioral specification (part of the LBLs being stored offchip in a DRAM).
The space complexity is Θ(n ^{2}+M). The latter term is entailed by the input matrix \({\mathcal {E}}\). Note that the maximum number of banks M has, typically, a small value, so it is negligible in comparison to the former term n ^{2}.
The banking solution can be determined calling a recursive function PrintOptimalPartition(s, 0, n, \({\mathcal {A}}\)).
void PrintOptimalPartition (s, i, j, \({\mathcal {A}}\)) {if (s[i, j] == i) print addr[i] ;else PrintOptimalPartition (s, i, s[i, j], \({\mathcal {A}}\)) ; PrintOptimalPartition (s, s[i, j], j, \({\mathcal {A}}\)) ;}
6 Experimental results
Experimental results for a motion detection algorithm
Parameters  Scalars  Memory  Minimum  Data storage  SPM  SPM  SPM  DRAM  DRAM  Total  CPU 

(array  accesses  data storage  (after mapping)  size  banks  energy  size  energy  energy  
elements)  [22]  [bytes]  [bytes]  [ μ J]  [bytes]  [ μ J]  [ μ J]  [sec]  
M=N=32  185,239  361,250  3364  3364  3362  2  2.24  0  0  2.24  1.8 
m=n=8  2306  2  2.14  1056  15.02  17.16  
1650  2  1.86  1712  18.09  19.95  
1250  1  1.64  2112  20.62  22.26  
0  –  0  3362  56.03  56.03  
M=N=64  2,632,615  5,229,378  13,124  13,124  13122  2  97.47  0  0  97.47  24.7 
m=n=16  6370  3  37.23  6752  263.59  300.82  
4802  2  32.52  8320  302.29  334.81  
0  –  0  13122  854.38  854.38 
Experimental results for energyaware assignment of signals to the on and offchip memory layers
Application  Scalars  Memory  Data memory  SPM  DRAM  Energy  Energy  CPU 

(array elem.)  accesses  after mapping  size  size  [ μ J]  savings  [sec]  
Motion detection  2,632,615  5,229,378  13,124  4802  8320  334.81  28.42 %  24.7 
Motion estimation  265,633  1,053,089  4513  256  4257  138.31  22.12 %  2.8 
Gaussian blur filter  53,615  77,619  14803  5003  9800  6.13  26.66 %  3.6 
Durbin algorithm  252,499  1,005,993  1,998  500  1498  123.62  20.40 %  39.1 
SVD updating algorithm  3,045,447  29,500,000  34,950  4096  30854  1601.52  24.81 %  47.5 
Voice coding kernel  33,835  47,416  14,634  2032  12602  2.56  18.76 %  4.8 
Experimental results for energyaware SPM banking
Application  Address  CPU\(^{full}_{expl.}\) [5]  CPU Alg.3  CPU\(^{dyn}_{prog.}\)  CPU Alg.4  Energy savings vs.  

space  M=4 [s]  M=8 [s]  [6] [s]  M=8 [s]  [5] (M=4)  [6] [s]  monolithic  
Motion detection  6370  3163  2.0  1592  1.4  7.2 %  2.9 %  41.2 % 
Motion estimation  1024  736  0.8  1.8  0.6  5.1 %  2.3 %  36.5 % 
Durbin’s algorithm  500  247  2.8  3.2  1.7  5.0 %  3.9 %  26.5 % 
SVD updating alg.  4096  2405  10.7  3048  8.2  6.4 %  3.6 %  28.4 % 
Voice coding kernel  2032  1297  1.5  336  1.1  7.8 %  5.4 %  31.4 % 
Table 1 shows several experiments considering as input application a motion detection algorithm—used in the transmission of realtime video signals on data networks. It displays the energy consumption in the memory subsystem for different data assignments to the memory layers. Column 1 shows the values of the parameters of the motion detection algorithm, columns 2 and 3 display the numbers of array elements and scalar signals, and the total numbers of read/write accesses. Column 4 displays the storage requirements of the application, computed with the algorithm from [22]—which is embedded in our framework. For the motion detection, our mapping algorithm (Section 4) finds optimal mapping solutions in terms of storage (column 5). Actually, two multidimensional signals from the application code will be stored in two registers: their footprint is only 1 byte each, since our tool correctly detected that their elements have disjoint lifetimes. Then, columns 6–10 present different scenarios for data assignment between the onchip SPM and the offchip DRAM, together with the energy consumption (both static and dynamic) in these memories.^{6} Column 11 displays the energy consumption in the memory subsystem, e.g., for the first set of parameters, the total energy increases from 2.24 μ J—when all the data is stored into the SPM—to 56.03 μ J—when all the data is stored offchip. The computation times (column 12) are very similar for each data assignment, so only the ballpark values are given.
The benchmarks used in the next tables are algebraic kernels—Durbin’s algorithm for solving Toeplitz systems; a singular value decomposition (SVD) updating algorithm [34] used in spatial division multiplex access (SDMA) modulation in mobile communication, in beamforming, and Kalman filtering— and a few multimedia applications: the kernel of an MPEG4 motion estimation algorithm for moving objects; a 2D Gaussian blur filter algorithm from a medical image processing application which extracts contours from tomograms in order to detect brain tumors; the kernel of a voice coding application—an essential component of a mobile radio terminal.
Table 2 displays in columns 2–3 information on the behavioral specification of the given application (column 1): the amounts of scalar signals (array elements) and the numbers of memory accesses. Column 4 shows the amount of data storage computed by the mapping algorithm. Then, column 7 displays the (static and dynamic) energy consumption in the memory subsystem when the sizes in bytes of the SPM and DRAM are the ones shown in columns 5–6. For a better evaluation of our energyaware data assignment model, we implemented another signal assignment strategy—similar to the one used in [23], where the steering mechanism is based on the intenselyaccessed cuts within the array space. The savings of energy consumption (column 8) were, typically, between 18 and 28 % relative to this model. The CPU times when executing the entire memory management flow are shown in column 9. The tests have been done for a 32 nm technology and assuming a clock frequency of 400 MHz.
Table 3 shows the savings of energy consumption after SPM banking (32 nm technology) for various benchmarks. Column 2 displays the number of addresses in the onchip memory. Column 3 reports the computation times for a full exploration with backtracking—implemented as the one presented in [4, 5] —targeting energy reduction, but using CACTI 6.5 [28] for power estimation (the maximum number of banks was set to M=4, since for larger values of M the times were unknown—as the exploration had to be stopped after several hours). Our own energy results for M=4 for all the benchmark tests were no more than 0.4 % higher than the optimal ones; but they all were obtained in only a fraction of a second, in contrast to the significant running times from column 3. Column 4 reports the computation times in seconds for our recursive banking algorithm that explored the search space for up to M=8 banks—a value that proved impossible for [5] if the wordlength is 1 byte.
Column 5 shows the computation times obtained running an implementation of the dynamic programming approach of Angiolini et al. [6]. The main input of this algorithm is the graph of memory accesses during the execution of the application code. The main data structure is an array having the numbers of rows and columns equal to the size (in words) of the graph of memory accesses and, respectively, the size of the SPM. The array elements are profit values targeting energy (or, alternatively, performance) optimization. The silicon area is indirectly taken into account by increasing heuristically the indexes of the profits computed during the dynamic programming by amounts depending on ratios of SPM areas. The time complexity of the algorithm is polynomial, depending on the product SPM size squared times the size of the graph. The practical running times can be significant, though, for benchmarks with a large memory address space and/or a large SPM (while typically faster than the full exploration with backtracking [5], we also encountered examples where this technique was slower, due in part to the fact that the number of banks is unconstrained).
Column 6 reports the computation times for our banking algorithm using dynamic programming: this technique proves to be faster than Algorithm 3, which was expected due to the polynomial complexity of Algorithm 4. The additional exploration constraint—that no disjoint lattice assigned to the SPM can cross a bank boundary—ensures the effectiveness of our basic approach when M≥4: this constraint significantly reduces the search space, typically yielding nearoptimal results.
The data structures of our our dynamic programming approach (see Section 5.B) are significantly smaller in size and the computation of energy costs (the elements of array C) allows portability from the backend tool CACTI to other memory models. In contrast, the dynamic programming approach from [6] uses a heuristic index increase (based on ratios of SPM areas) in the array of the energy profits, which is dependent on the memory model employed.^{7} Our dynamic programming technique can optimize die area instead of energy consumption (or a weighted combination of the two) by redesigning the function SPM_energy from Algorithm 4.
Not only the computation times of our tool were far better, but our tool found partitions of more than 4 banks which were superior in terms of energy consumption than the fourbank solutions found by the previous technique [5]: column 7 reports the energy savings versus the full exploration for M=4. Column 8 shows the savings of energy consumption of our algorithm versus the dynamic programming approach similar as [6]. Note that this dynamic programming technique yielded better results than [5] since it found energeticallybetter solutions that had more than four banks. On the other hand, our algorithm found even better solutions since it could exploit the idleness intervals of the memory banks (which [6] does not do). Column 9 displays the energy savings obtained by our tool, with respect to the case of a monolithic SPM.
We also tested the algorithms from this EDA framework on a larger code of about 900 lines (mentioned also in [22]), containing 113 loop nests threelevel deep and 906 array references—many having complex indexes. Algorithm 1 ran in about 2.4 minutes, building the DAG of inclusions (like the one illustrated in Fig. 6) with 3159 nodes (LBLs), and preparing the polyhedral data structures required by the memory management tasks. Algorithm 2 was fast, running in less than 10 s. (Note that there was a preliminary step, not taken here into account, when our CACTI interface obtained data on power and access times by running CACTI 6.5 for a range of DRAM and SPM sizes: afterwards, these data can be used in other benchmarks as well). The signaltomemory mapping step was more computationallyexpensive (almost 4 min) since many LBLs from the specification code were produced and consumed in the same loop nests, and the number of canonical linearizations of 3D arrays is 48. Algorithm 3—the recursive algorithm with backtracking, ran in 3.7 min for a maximum number of SPM banks of M=5, while Algorithm 4 was even faster: 2.3 min.
7 Conclusions
This paper has presented an EDA framework for the highlevel design of hierarchical memory architectures, targeting embedded dataintensive signal processing applications. The methodology presented in this paper is focused on the reduction of the energy consumption is the memory subsystem. The data assignment to the storage layers, the signaltomemory mapping, as well as the onchip memory banking, are all efficiently addressed within a common polyhedral framework. The steering assignment mechanism is based on the identification of the intenselyaccessed regions within the array space of the multidimensional signals. The added flexibility of this assignment model led to superior energy savings in comparison to earlier approaches.
8 Endnotes
^{1} That is, the execution ordering is induced by the loop structure and, hence, it is fixed. The research on code transformation is orthogonal to our methodology, but it could be used as a preliminary step.
^{2} Solving a linear Diophantine system was proven to be of polynomial complexity, the various methods being typically based on bringing the system matrix to Hermite Normal Form [26].
^{3} CACTI 6.5 is an analytical tool that takes a set of SPM, cache, or DRAM parameters as inputs and calculates memory data – like access time, static power, dynamic energy spent per access, and area [28].
^{4} The computation method employed by De Greef et al. consists of a sequence of integer linear programming (ILP) optimizations for each canonical linearization [10].
^{5} This is based on the computation of the lexicographically minimum and maximum iterator vectors of the lattice elements in normalized loops, operation described in [22].
^{6} Memory generators do not allow all possible values for memory sizes or for bank boundaries: for instance, a memory generator may yield storage blocks with only a multiple of 16 bytes. Although our framework can take into account such kind of constraints, these tests aim to illustrate the algorithms, so no such constraint is imposed.
^{7} This is a key reason why a comparison with the results on the benchmarks in [6] is difficult to achieve without insider knowledge.
Declarations
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 F Catthoor, K Danckaert, C Kulkarni, E Brockmeyer, PG Kjeldsberg, TV Achteren, T Omnes, Data Access and Storage Management for Embedded Programmable Processors (Springer, 2010).Google Scholar
 PR Panda, N Dutt, A Nicolau, F Catthoor, A Vandecapelle, E Brockmeyer, C Kulkarni, E De Greef, Data memory organization and optimizations in applicationspecific systems. IEEE Design & Test of Computers, 56–68 (2001).Google Scholar
 M Verma, P Marwedel, Advanced Memory Optimization Techniques for LowPower Embedded Processors (Springer, 2007).Google Scholar
 A Macii, L Benini, M Poncino, Memory Design Techniques for Low Energy Embedded Systems (Kluwer Academic Publ., Boston, 2002).View ArticleMATHGoogle Scholar
 L Benini, L Macchiarulo, A Macii, M Poncino, Layoutdriven memory synthesis for embedded SystemsonChip. IEEE Trans. VLSI Syst.10(2), 96–105 (2002).View ArticleGoogle Scholar
 F Angiolini, L Benini, A Caprara, An efficient profilebased algorithm for scratchpad memory partitioning. IEEE Trans. ComputerAided Design IC’s Syst.24(11), 1660–1676 (2005).View ArticleGoogle Scholar
 PR Panda, N Dutt, A Nicolau, Onchip vs.offchip memory: the data partitioning problem in embedded processorbased systems. ACM Trans. Design Automation Electronic Syst.5(3), 682–704 (2000).View ArticleGoogle Scholar
 M Kandemir, G Chen, F Li, in Proc. AsiaSouth Pacific Design Aut. Conf. Maximizing data reuse for minimizing space requirements and executive cycles (Yokohama, Japan, 2006), pp. 808–813.Google Scholar
 E Brockmeyer, M Miranda, H Corporaal, F Catthoor, in Proc. 6th ACM/IEEE Design and Test in Europe Conf. Layer assignment techniques for low energy in multilayered memory organisations (Munich, Germany, 2003), pp. 1070–1075.Google Scholar
 E De Greef, F Catthoor, H De Man, in Parallel Computing, 23. Memory size reduction through storage order optimization for embedded parallel multimedia applications”, special issue on Parallel Processing and Multimedia” (ed. A. Krikelis) (Elsevier, 1997), pp. 1811–1837.Google Scholar
 R Tronçon, M Bruynooghe, G Janssens, F Catthoor, Storage size reduction by inplace mapping of arrays. Verification, Model Checking and Abstract Interpretation, 167–181 (2002).Google Scholar
 S Verdoolaege, JC Juega, A Cohen, JI Gomez, C Tenllado, F Catthoor, Polyhedral Parallel Code Generation for CUDA. ACM Trans. Arch. Code Optimization. 9(4), 54–77 (2013).Google Scholar
 A Darte, R Schreiber, G Villard, Latticebased memory allocation. IEEE Trans. Comput. 54:, 1242–1257 (2005).View ArticleGoogle Scholar
 W Shiue, C Chakrabarti, in Proc. 36th ACM/IEEE Design Aut. Conf. Memory exploration for low power embedded systems (New Orleans, 1999), pp. 140–145.Google Scholar
 O Golubeva, M Loghi, M Poncino, E Macii, in Proc. ACM/IEEE Design Automation and Test in Europe. Architectural leakageaware management of partitioned scratchpad memories (Nice, France, 2007), pp. 1665–1670.Google Scholar
 M Loghi, O Golubeva, E Macii, M Poncino, Architectural leakage power minimization of scratchpad memories by applicationdriven subbanking. IEEE Trans. Comput.59(7), 891–904 (2010).MathSciNetView ArticleGoogle Scholar
 W Kelly, V Maslov, W Pugh, E Rosser, T Shpeisman, D Wonnacott, The Omega Library interface guide (1995). Technical Report CSTR3445, Univ. of Maryland, College Park.Google Scholar
 R Wilson, R French, C Wilson, S Amarasinghe, J Anderson, S Tjiang, SW Liao, CW Tseng, M Hall, M Lam, J Hennessy, SUIF: An infrastructure for research on parallelizing and optimizing compilers. ACM SIGPLAN Notices. 29(12), 31–37 (1994).View ArticleGoogle Scholar
 J Ramanujam, J Hong, M Kandemir, A Narayan, A Agarwal, Estimating and reducing the memory requirements of signal processing codes for embedded systems. IEEE Trans. Signal Process.54(1), 286–294 (2006).View ArticleGoogle Scholar
 V De La Luz, I Kadayif, M Kandemir, U Sezer, Access pattern restructuring for memory energy. IEEE Trans. Parallel Distributed Syst. 15(4) (2004).Google Scholar
 R Allen, K Kennedy, Optimizing Compilers for Modern Architectures: A DependenceBased Approach (Morgan Kaufmann Publ., 2001).Google Scholar
 F Balasa, H Zhu, II Luican Computation of storage requirements for multidimensional signal processing applications. IEEE Trans. VLSI Syst.14(4), 447–460 (2007).View ArticleGoogle Scholar
 Q Hu, A Vandecapelle, M Palkovic, PG Kjeldsberg, E Brockmeyer, F Catthoor, in Proc. AsiaSouth Pacific Design Autom. Conf. Hierarchical memory size estimation for loop fusion and loop shifting in datadominated applications (Yokohama, Japan, 2006), pp. 606–611.Google Scholar
 G Talavera, M Jayapala, J Carrabina, F Catthoor, Address generation optimization for embedded highperformance processors: A survey. J. Signal Process. Syst., Springer. 53(3), 271–284 (2008).View ArticleGoogle Scholar
 L Thiele, in Stateoftheart in Computer Science, ed. by P Dewilde. Compiler techniques for massive parallel architectures (Kluwer Acad. Publ., 1992).Google Scholar
 A Schrijver, Theory of Linear and Integer Programming (John Wiley, New York, 1986).MATHGoogle Scholar
 PH Clauss, in Proc. European Conf. on Parallel Processing. Handling memory cache policy with integer point counting (Passau, Germany, 1997), pp. 285–293.Google Scholar
 CACTI 6.5. [Online]. Available: http://www.cs.utah.edu/~rajeev/cacti6/.
 S Verdoolaege, K Beyls, M Bruynooghe, F Catthoor, in Compiler Construction: 14th Int. Conf, 3443, ed. by R Bodik. Experiences with enumeration of integer projections of parametric polytopes (Springer, 2005), pp. 91–105.Google Scholar
 A Helal, F Balasa, in Proc. 20th IEEE Int. Conf. on Control Systems and Computer Science. Multithreaded signaltomemory mapping algorithm for embedded multidimensional signal processing (Bucharest, Romania, 2015), pp. 255–260.Google Scholar
 [Online]. Available: www.latticesemi.com.
 K Flautner, N Kim, S Martin, D Blaauw, T Mudge, in Proc. Symp. Computer Architecture. Drowsy caches: simple techniques for reducing leakage power, (2002), pp. 148–157.Google Scholar
 M De Berg, O Cheong, M van Krefeld, M Overmars, Computational Geometry: Algorithms and Applications (Springer, 2010).Google Scholar
 M Moonen, PV Dooren, J Vandewalle, An SVD updating algorithm for subspace tracking. SIAM J. Matrix Anal. Appl.13(4), 1015–1038 (1992).MathSciNetView ArticleMATHGoogle Scholar