After being assigned to the off- and on-chip memory layers, the linearly bounded lattices are mapped to the external DRAM and SPM; so, the distribution of the memory accesses to the SPM address space is known. Let us assume that the range of contiguous addresses mapped to the on-chip SPM is {0, 1, …, N−1}, that memory is word-addressable and the word width is known (being imposed by the chosen core processor). The dynamic energy \(E_{1}^{dyn}(0,N)\) (where the arguments are the start address and the number of words, the subscript being the number of banks) consumed by a monolithic SPM is [5]
$$\begin{aligned} E_{1}^{dyn}(0,N)=\!E_{R}(N)\cdot\! \sum_{i=0}^{N-1}read[\!i]\ +\ E_{W}(N)\cdot \sum_{i=0}^{N-1}write[\!i]\!, \end{aligned} $$
where E
R
(N) and E
W
(N) are the energies consumed per read, respectively write, access to an SPM of N words; they are technology-dependent metrics. In addition, r
e
a
d[ i] and w
r
i
t
e[ i] represent the number of accesses to word i and, consequently, the sums represent the total numbers of read/write accesses to the on-chip memory locations 0, 1, …, N−1.
If the address space of the on-chip SPM is arbitrarily partitioned in two ranges {0, 1, …, k−1} and {k, k+1, …, N−1}, then the dynamic energy consumed in the two-bank SPM becomes:
$$E_{2}^{dyn}(0,k,N)=E_{1}^{dyn}(0,k)+E_{1}^{dyn}(k,N-k) $$
The first two arguments of \(E_{2}^{dyn}\) are the start addresses in words of the two banks, the third being the total size.
The static energy consumed in the two-bank SPM, having the address space partitioned as above, is the sum of the static energies in each bank: \(E_{2}^{st}(0,k,N)=E_{1}^{st}(0,k)+E_{1}^{st}(k,N-k)\). Neither term depends on the number of memory accesses.
The partitioning is energetically beneficial if \(E_{2}^{dyn}(0,k,N)+E_{2}^{st}(0,k,N)+\Delta E_{12}\ <\ E_{1}^{dyn}(0,N)+E_{1}^{st}(0,N)\), where Δ
E
12 is the energy overhead required by the extra logic (usually, a decoder) and interconnections necessary to move from the monolithic SPM to a two-bank architecture. Figure 12 shows the more complex architecture of a multi-bank versus the monolithic architecture: the additional components and interconnects—the address and data buses, the decoder, the control signals—may introduce a non-negligible overhead on power consumption that must be compensated by the savings entailed by bank partitioning. These savings are caused by the average power decrease in accessing the memory hierarchy, because a large fraction of accesses is typically concentrated on a smaller, more energy-efficient bank. In addition, the memory banks that stay idle long enough can be disabled through their chip-select (CS) pins. Equivalently, the partitioning is energetically beneficial if the energy benefit of the two-bank solution
$$\left(1\ -\ \frac{E_{2}^{dyn}(0,k,N)+E_{2}^{st}(0,k,N)+\Delta E_{12}} {E_{1}^{dyn}(0,N)+E_{1}^{st}(0,N)} \right) \times 100\ [\%] $$
versus a monolithic SPM is positive.
The solution space of two-way memory banking can be exhaustively explored (and, hence, optimally solved in energy point of view) by iteratively moving the upper bound k of the first bank from 1 to N−1, and finding the global minimum: \(\min _{k}{\{E_{2}^{dyn}(0,k,N)+E_{2}^{st}(0,k,N)\}}+\Delta E_{12}\).
A similar cost metric can be used to explore multi-way banking solutions: any possible partition into M (≥2) banks is defined by a set of M-1 addresses identifying the memory bank boundaries. Based on this idea, Benini et al. implemented a recursive algorithm [5] where the solution space is exhaustively explored (their main input is the graph of the distribution of memory accesses to the SPM address space, rather than the behavioral specification of the application). This search for an energetically-optimal solution proves to be computationally expensive (see Section 6), even infeasible, for larger values of M—the maximum number of banks—and/or larger values of the SPM size. Angiolini et al. carried out more efficiently a similar exploration using dynamic programming [6]. Although the time complexity is polynomial, our experiments found that the running times of their method exhibit a fast increase with the sizes of the SPM and the execution trace (the computation time was over 8 h for our illustrative example in Fig. 4, and an SPM size of 8 Kbytes, when the memory word was 1 byte. For SPM sizes smaller than 2 Kbytes, the technique can be effective, though).
The banking algorithms we propose are consistent with our model of partitioning the array space of signals into disjoint lattices (see Section 3). For M<4, these algorithms are basically identical to the exploration algorithm presented in [5] since this approach yields optimal solutions. For M≥4, as the running times may be extremely large, we introduce a constraint that significantly reduces the exploration space: no SPM-assigned lattice can cross a bank boundary. This constraint ensures the effectiveness of our approach in point of view of speed and near-optimality of the results—as Example 4 will show.
5.1 Lattice-based recursive algorithm
In addition to M, the maximum number of SPM banks, the inputs of the SPM partitioning algorithm are:
Input 1: An array \({\mathcal {A}}=[{addr}_{0},\ {addr}_{1},\ \ldots \,\ {addr}_{n}]\) of ordered addresses such that a linearly bounded lattice L
k
, k=1,…,n, assigned to the on-chip memory layer be mapped at the SPM successive addresses {a
d
d
r
k−1,…, a
d
d
r
k
−1}.
Input 2: An array \({\mathcal {R}W}=[{rw}_{1},\ \ldots,\ {rw}_{n}]\) which elements represent the numbers of read/write accesses for each lattice mapped onto the SPM (notice that the numbers of read/write accesses for each lattice mapped onto the SPM are already known from Section 3).
Input 3: An array \({\mathcal {E}}=[\Delta E_{12},\ \Delta E_{23},\ \ldots \, \ \Delta E_{M-1,M}]\), which elements Δ
E
k,k+1 are the energy overheads resulting from moving from an on-chip SPM with k banks to one with k+1 banks. The decoding circuitry was synthesized using the ECP family of FPGA’s from lattice semiconductor [31] and, for the energy overheads, we used the power calculator from Lattice Diamond [31].
Output: The energetically-optimal SPM partitioning, i.e., an array of SPM addresses delimiting the banks, and the minimum value of the total (static and dynamic) energy consumption for this optimal SPM banking solution.
The algorithm starts from the monolithic architecture and searches for the energetically-optimal partitioning of the SPM in no more than M memory banks, such that the borderlines between banks are addresses in the array \({\mathcal {A}}\) (hence, ensuring that any lattice of signals is entirely stored in one bank). A variable crtBestSolution records the set of addresses in \(\mathcal {A}\) corresponding to the most energetically-efficient partition reached in any moment of the exploration; initially, the SPM being monolithic, this set is {a
d
d
r
0, a
d
d
r
n
}. A variable crtMinEnergy registers the total energy consumption of the best SPM banking solution encountered during the exploration. A function SPM_energy(bank_size, number_accesses) uses CACTI 6.5 [28] and the number of read/write accesses in order to compute the total energy (both static and dynamic) consumed in a bank of the specified size. A recursive function Multi_Bank, whose first formal parameter m (initially equal to 2) is the current number of banks, searches for the optimal solution such that the first bank ends at a
d
d
r
k
. This function is successively called for k=1, 2, …, n−1. EnergyConsumed registers the amount of energy consumed from the start of the SPM till the borderline a
d
d
r
k
. If its value exceeds the best energy already recorded (crtMinEnergy), there is no need to continue the exploration since all the next solutions will be energetically-worse—due to the monotonic increase of the energy consumption with the SPM size.
Outputs: crtBestSolution – an ordered set of SPM addresses from \({\mathcal {A}}\) delimiting the banks, and the corresponding energy consumption crtMinEnergy.
The recursive function Multi_Bank searches for the best banking solution starting from a
d
d
r
k
till the end of the SPM at a
d
d
r
n
. From the beginning of the SPM (a
d
d
r
0) till the address a
d
d
r
k
there are already m−1 banks. At the beginning, the function considers [ a
d
d
r
k
, a
d
d
r
n
] as the mth bank of the SPM: if this banking configuration is energetically better than all previous solutions, it is duly recorded as the best solution reached during the exploration. If the maximum number of banks is not reached yet (m<M), then the function explores solutions with m+1 banks or more, considering the mth bank to be [a
d
d
r
k
, a
d
d
r
j
], for j=k+1, k+2, …, n−1.
A solution stack is used and typical stack functions (push, pop, top) to record and resume the partial banking solutions. For instance, the push instruction in the body of the Multi_Bank function takes the set of memory addresses on the top of the stack, adds the new element a
d
d
r
k
to it, and the new set is pushed back on the stack.
Since the energy cost is monotonically increasing with the SPM size, a backtracking mechanism is incorporated before the recursive call to prevent the search towards more energetically-expensive partitions. The output of the algorithm is an array of SPM addresses delimiting the banks, and the corresponding energy consumption.
In addition, for each LBL in the decomposition of the array space, we compute the time intervals (in clock cycles) when the lattice is not accessed. This idleness analysis cannot be done directly in terms of time: first, it is done in terms of loop iterators. For instance, we must determine the iterator vectors in the loop nests when a disjoint lattice is accessed for the first time and for the last time.5 Only afterwards, we compute the clock cycles during the code execution corresponding to those iterator vectors. When the recursive function Multi_Bank investigates the case when the m-th bank is between A
d
d
r[ k] and A
d
d
r[ n] (the end of the SPM), the idleness intervals of the lattices L
k+1, …,L
n
assigned to this m-th bank are intersected in order to determine whether there are idleness intervals at the bank level. If this is the case, the bank can be switched to the sleep state during the idleness intervals that are large enough. (A time overhead of one clock cycle for the transition from the sleep to the active state is also applied, in accordance with simulated data on caches reported in [32]). In order to overcome the energy overhead entailed by the transition of a memory bank from the active state into the sleep state and back to the active state, the bank must remain in the sleep state at least a minimum number of clock cycles (otherwise, the economy of static energy is lesser than the energy overhead of the transitions). This idleness threshold in cycles can be estimated; typical values are in the order of hundreds of cycles [16]. So, if the idleness of a bank (resulted from the intersection of the idleness intervals of the lattices assigned to the bank) exceeds the idleness threshold, then the energy cost of the bank is computed taking into account the switches to the sleep state and back. The idleness intervals of each lattice are organized into an interval tree [33] as the depth of this data structure is O(l
o
g
n) for n intervals, and typical interval operations have logarithmic complexity.
A high-level pseudo-code of the recursive function Multi_Bank is given below:
Example 4: Let us consider again the illustrative example from Fig. 4, where the 64 central lattices of the array A (numbered 96–159 in Fig. 8) were stored in an SPM of 8 Kbyte, the external DRAM being of 56 Kbytes. As shown in Section 3, the energy consumption of this monolithic SPM is 3,063.6 μ
J, assuming a 32 nm technology. Running the banking algorithm from [5], where the maximum number of banks M was set to 4, the 4-bank optimal banking solution (of 1,602.23 μ
J) was found in 4433 s, after the exploration of 58.73 billion SPM partitions. (Setting M at a higher value than 4 proved to be computationally infeasible for [5]). In contrast, Algorithm 3 (for M=8) found a 6-bank solution of a lower energy (1564.95 μ
J) in only 30.38 s, exploring less than 1 % SPM partitions (363.6 million versus 58.73 billion). Figure 13 displays the graphs of the best energies of banking solutions (the values of crtMinEnergy from Algorithm 3) found by the two techniques after analyzing the first ten thousand banking configurations. The figure shows that our algorithm finds faster the energetically-better solutions. To be sure, Algorithm 3 finds only sub-optimal solutions for values of M larger than 3; but these solutions are near-optimal and they are found very fast. For instance, setting M=4, Algorithm 3 found a 4-bank solution of 1604.01 μ
J, slightly worse than the optimal value 1,602.23 μ
J; on the other hand, the run time was only 0.012 s, that is, several orders of magnitude faster than 4433 s!
5.2 Lattice-based dynamic programming algorithm
The inputs are identical as in the previous algorithm, except the last one:
Input 3: An array \({\mathcal {E}} = [\!0\ \Delta E_{2}\ \Delta E_{3}\ \ldots \ \Delta E_{M}]\) which elements Δ
E
k
(1<k≤M) are energy overheads resulting from moving from a monolithic SPM to one with k banks (obviously, Δ
E
1=0). These energy overheads were estimated with the power calculator from Lattice Diamond [31].
The main data structures used by the algorithm are:
-
2-D “cost” array C : each element C[i, j] (0≤i<j≤n) is initialized to the energy consumed by a monolithic SPM having the address space [a
d
d
r
i
, a
d
d
r
j
) and storing the linearly bounded lattices L
i+1, …,L
j
; in particular, C[0, n] is, initially, the energy consumed by the whole monolithic SPM. At the end of the algorithm, each element C[i, j] will contain the energy consumption after the address space [a
d
d
r
i
, a
d
d
r
j
) was optimally partitioned—under the constraint that bank boundaries are only addresses from the input array \({\mathcal {A}}\). The additional exploration constraint—that no disjoint lattice assigned to the SPM can cross a bank boundary— ensures the effectiveness of the approach.
-
2-D array m : each element m[i, j] (0≤i<j≤n) is the number of banks in the address space [a
d
d
r
i
, a
d
d
r
j
); their initial value is 1.
-
2-D array s : used for constructing an optimal partitioning solution.
The first loop nest (instructions 4–7) initializes the three arrays: C – see Fig. 14
a, m, and s. The function S
P
M
energy
uses information provided by CACTI 6.5 [28]—the dynamic energy per access and the static power—to compute the energy consumption of a monolithic SPM of (a
d
d
r
j
−a
d
d
r
i
) bytes, which is accessed \(\sum _{k=i+1}^{j} {rw}_{k}\) times. A 32-nm technology is assumed by default. For the computation of the static energy consumption, the number of clock cycles for the execution of the given application is obtained by simulation; a frequency of 400 MHz is used by default (but this value can be modified by the user).
Afterwards, the next loop nest (starting at instruction 8) computes with a bottom-up approach (see Fig. 14
b) the energetically-optimal banking in the address space [a
d
d
r
i
, a
d
d
r
j
), where j−i=L increases gradually from L=2 to L=n. The last value of L corresponds to the entire address space of the SPM: [a
d
d
r
0, a
d
d
r
n
). For each pair (i, j), the optimal energy cost C[i, j] is computed (instructions 11–19) as the minimum between the energy consumed in the monolithic case, and
$${min}_{i<k<j}{\{\ C[\!i\,\ k]\ +\ C[\!k\,\ j]\ +\ EnergyOverhead\ \}} $$
that is, when a bank boundary is introduced at a
d
d
r
k
(in between a
d
d
r
i
and a
d
d
r
j
). Since the component address spaces [a
d
d
r
i
, a
d
d
r
k
) and [a
d
d
r
k
, a
d
d
r
j
) are already partitioned into m[i,k] and, respectively, m[ k,j] banks, the number of banks in the address space [a
d
d
r
i
, a
d
d
r
j
) would increase to m[ i,k] + m[ k,j], entailing an energy overhead of Δ
E
m[i,k]+m[k,j]– which must be added to the energy cost. At the same time, the energy overheads Δ
E
m[i,k] and Δ
E
m[k,j]—corresponding to the m[ i,k] and m[k,j] banks—must be subtracted from the cost (see instructions 14 and 16).
Note that, if any of the component address spaces is not partitioned, say m[ i,k]=1, then Δ
E
m[i,k] = Δ
E
1 = 0. All these energy overheads are elements of the array \({\mathcal {E}}\) (Input 3).
The conditional instruction 13 eliminates the solutions exceeding M number of banks. Arbitrarily fine partitioning is prevented since an excessively large number of small banks is area inefficient, imposing a severe wiring overhead, which also tends to increase communication power and decreaseperformance.
The time complexity of the algorithm is Θ(n
3) due to the three for loops (instructions 8, 9, and 12). The main parameter n is the number of disjoint linearly bounded lattices that are assigned to the SPM; these n lattices represent only a subset of the disjoint LBLs that result from the partitioning of the multidimensional signals from the behavioral specification (part of the LBLs being stored off-chip in a DRAM).
The space complexity is Θ(n
2+M). The latter term is entailed by the input matrix \({\mathcal {E}}\). Note that the maximum number of banks M has, typically, a small value, so it is negligible in comparison to the former term n
2.
The banking solution can be determined calling a recursive function PrintOptimalPartition(s, 0, n, \({\mathcal {A}}\)).
void PrintOptimalPartition (s, i, j, \({\mathcal {A}}\)) {if (s[i, j] == i) print addr[i] ;else PrintOptimalPartition (s, i, s[i, j], \({\mathcal {A}}\)) ; PrintOptimalPartition (s, s[i, j], j, \({\mathcal {A}}\)) ;}