A novel power model for future heterogeneous 3D chip-multiprocessors in the dark silicon age

Asad, Arghavan; Dorostkar, Aniseh; Mohammadi, Farah

doi:10.1186/s13639-018-0086-1

Research
Open access
Published: 27 July 2018

A novel power model for future heterogeneous 3D chip-multiprocessors in the dark silicon age

Arghavan Asad¹,
Aniseh Dorostkar¹ &
Farah Mohammadi¹

EURASIP Journal on Embedded Systems volume 2018, Article number: 3 (2018) Cite this article

3407 Accesses
4 Citations
Metrics details

Abstract

Dark silicon has recently emerged as a new problem in VLSI technology. Maximizing performance of chip-multiprocessors (CMPs) under power and thermal constraints is very challenging in the dark silicon era. Providing next-generation analytical models for future CMPs which consider the impact of power consumption of core and uncore components such as cache hierarchy and on-chip interconnect that consume significant portion of the on-chip power consumption is largely unexplored. In this article, we propose a detailed power model which is useful for future CMP power modeling. In the proposed architecture for future CMPs, we exploit emerging technologies such as non-volatile memories (NVMs) and 3D techniques to combat dark silicon. Results extracted from the simulations are compared with those obtained from the analytical model. Comparisons show that the proposed model accurately estimates the power consumption of CMPs running both multi-threaded and multi-programed workloads.

1 Introduction

In today’s chip-multiprocessor (CMP) architectures, power consumption is the primary constraint during system design. In the nanometer era, leakage power depletes the power budget and has substantial contribution in overall power consumption. A study by Kao and colleagues has shown that over 50% of the overall power dissipation in a 65-nm generation is due to the leakage power [1] and this percentage is expected to increase in the next generations [2, 3].

Due to the breakdown of Dennard scaling, the fraction of transistors that can be simultaneously powered on within the peak power and temperature budgets is dropping exponentially with each process generation. This phenomenon has been termed as the dark silicon era which is one of the newest challenges in multicore design [4]. Research shows that the increasing leakage power consumption is a major driver of unusable portion or dark silicon in future many-core CMPs [4]. Uncore components such as memory and on-chip interconnect play a significant role in consuming a large portion of power. Also, uncore components, especially those in the cache hierarchy, are the dominant leakage consumers in multi/many-core CMPs. Therefore, power management of these components can be critical to maximize design performance in the dark silicon era. Predictions in recent studies indicate that more than 50% of chips will be effectively dark, idle, dim, or under-clocked dark silicon [5], and this percentage will increase by scaling down in transistor dimension. Therefore, it is extremely important to provide next-generation architectural techniques, design tools, and analytical models for future many-core CMPs in the presence of dark silicon [6, 7]. Prior studies on dark silicon only focus on core designs to address the problem. In this work, we show that uncore components such as cache hierarchy and on-chip interconnect are significant contributors in the overall chip power budget in the nanoscale era and play important roles in the dark silicon age. Since the increase in the CMOS device’s power density leads to the dark silicon phenomenon, the emerging power-saving materials manufactured with nanotechnology might be useful for illuminating the dark area of future CMPs.

The long switch delay and high switch energy of such emerging low-power materials are the main drawbacks which prevent manufactures from completely replacing the traditional CMOS in future processor manufacturing [8]. Therefore, architecting heterogeneous CMPs and integrating cores and cache hierarchy made up of different materials on the same die emerges as an attractive design option to alleviate the power constraint. In this work, we use emerging technologies, such as three-dimensional integrated circuits (3D ICs) [9, 10] and non-volatile memories (NVMs) [11,12,13] to exploit the device heterogeneity and design of dark silicon-aware multi/many-core systems. 3D die-stacking helps core and uncore components manufactured in different technologies to be integrated into a single package to reduce global wire lengths and improve performance. Among several benefits offered by 3D integrations compared to 2D technologies, mixed-technology stacking is especially attractive for stacking NVM on top of CMOS logics, and designers can take full advantage of the attractive benefits that NVM provides.

In this paper, we propose an accurate power model that formulates the power consumption of 3D CMPs with stacked cache layers. This model can be used for both of the homogenous and heterogeneous cache layers. Unlike the previous research on dark silicon which considers only the portion of power consumption related to on-chip cores [4, 14,15,16], the proposed model considers power impact of uncore components, such as cache hierarchy and on-chip interconnect, as important contributors in the total CMP power consumption.

In future many-core CMPs, at 22 nm and beyond, emerging leakage-aware technologies such as FinFETs, FDSOI structures, and non-volatile memories are materials for architecting heterogeneous components. The proposed power model in this work can be applied for different technologies with changing power and latency parameters of the new technology.

McPAT [17] (an integrated power, area, and timing modeling framework for multithreaded, multicore, and many-core architectures) cannot estimate the power consumption of 3D CMPs. The maximum number of cores which McPAT supports for power modeling in a many-core processor is 128 when attached to GEM5 [18] and the reason is limitations of existing 2D integration. NVmain [19] (a user-friendly memory simulator to model (non-) volatile memory systems) is a tool just for estimating the power consumption of memory components. It does not consider the power consumption of core and uncore components simultaneously. To the best of our knowledge, the proposed model is the first work in power modeling of network-on-chip (NoC)-based CMPs with stacked cache hierarchy as future CMPs.

In this paper, we make the following novel contributions:

1.
We propose an accurate power model for future CMPs with stacked cache layers that support the impact of power consumption of core and uncore components in parallel.
2.
The proposed power model for 3D CMPs supports power analysis for both multi-programed and multi-threaded workloads.
3.
In the proposed power model, we target CMPs with a large number of cores (e.g., more than eight (many-core CMPs)) built based on scalable networks-on-chip (NoCs) and nonuniform cache architectures (NUCA) for the first time.
4.
Our experimental results show that the value of the proposed model is truly close to the value derived by the simulation for each benchmark.

The rest of this paper is organized as follows. A brief background on traditional and NVM technology is explained in Section 2. Section 3 describes the related work. Section 4 analyzes the power consumption of core and uncore components in multicore processors. Section 5 explains the target heterogeneous 3D CMP architecture used in this work. Section 6 presents the power model for the target 3D CMP with the stacked cache hierarchy. In Section 7, evaluation results are presented. Finally, Section 8 concludes the paper.

2 Background

Since the proposed power model can be used for both of the homogenous and heterogeneous stacked cache layers, we first compare characteristics of different traditional and non-volatile memory technologies with each other. Then, we review the STT-RAM technology as a well-known type of NVM technologies.

The traditional and high-performance SRAM technology has been widely used in the on-chip caches due to its standard logic compatibility, high endurance, and fast access time features [20]. However, low-density SRAM technology dissipates high leakage power by its six-transistor implementation [21] and has become a bottleneck for energy-efficient designs. By increasing demand of larger memories in computing systems, using conventional SRAM-based caches becomes more expensive. DRAM technology has become a viable alternative for implementing on-chip caches due to its high density, high capacity, low leakage, and high write endurance features. It is possible to have large reliable last-level caches with high bandwidth by stacking low-leak and high-density DRAM as an on-die cache. However, conventional eDRAM technology tends to be slow compared with SRAM technology and consumes a significant amount of energy in the form of refresh energy to retain stored data which have negative impact on performance.

Compared with traditional memory technologies such as SRAM and eDRAM, NVM technologies such as STT-RAM and PRAM commonly offer many desirable features like near-zero leakage power consumption, non-volatile property, high cell density, and high resilience against soft errors. Based on the mentioned characteristics for NVMs, the most important feature of the NVM technologies suitable to combat recent dark silicon challenge is near-zero leakage power consumption. As shown in Fig. 1, due to the magnetic characteristic of the MTJ blocks in NVM memory cells, there is not any leakage path between the source line and bit line; therefore, the static power consumption is near zero. However, it should be noted that NVMs suffer from shortcomings such as limited number of write operations and long write operation latency and energy. Compared with other technologies, PRAM is too slow and is not suitable for low-level caches but can be used as a large last-level cache. Table 1 provides a brief comparison between SRAM, STT-RAM, eDRAM, and PRAM technologies in 32-nm technology. The parameter values have been estimated by NVSim [22] and CACTI [23] in this table.

Table 1 Different memory technology comparisons at 32 nm

Full size table

In this section, STT-RAM as a well-known type of NVM technologies, shown in Fig. 1, is briefly explained.

As shown in Fig. 1c, a STT-RAM cell consists of a magnetic tunnel junction (MTJ) to store bit information. A MTJ as a fundamental building block in NVM technologies consists of two ferromagnetic layers separated by a dielectric layer. While the direction of one ferromagnetic layer is fixed, the other layer can be controlled by passing a large enough current through the MTJ. If this current exceeds the critical value, the magnetization direction of the two layers will become antiparallel and MTJ will be in high resistance indicating a “1” logic (Fig. 1b); otherwise the magnetization directions of the two layers are parallel and MTJ is in low resistance indicating a “0” logic (Fig. 1a). It should be noted that the resistance of the MTJ relates not only to the current intensity but also to the current direction matters. If the electrons flow from the reference to the free layer, the magnetic momenta become parallel resulting in a low resistance and the bit 1. If the electrons flow in the reverse direction, we obtain antiparallel momenta and bit 0.

3 Related work

A majority of prior low-power techniques focus on power management at the processor level and the only knob that they use to control the power of a multicore processor is the voltage/frequency level of the cores [24, 25]. A number of researches have proposed some proactive techniques such as thread scheduling, thread mapping, shutting-down schemes, and migration policies to reduce the power consumption in multicore systems [26,27,28,29,30]. However, these approaches limit their scope only to cores.

Management of a problem recently known as dark silicon related to limited power budget is a new challenge in future multicore designs [4, 14,15,16, 31]. To address the challenges of the dark silicon, Esmailzadeh et al. [4] focused on using only general-purpose cores. They ignored the power impact of uncore components; however, they explained that this was a limitation of their work. The research in [14, 15] works on synthesis of heterogeneous CMPs to extract better energy efficiency and performance in the dark silicon era. Turakhia et al. [14] proposed a design time framework for synthesis of heterogeneous dark silicon CMPs. Raghunathan et al. [15] exploited process variation to evaluate the benefits of selecting a more suitable subset of cores for an application in a given fixed dark silicon power budget to maximize performance. Venkatesh et al. [16] introduced the concept of “conservation cores.” They are specialized processors that focus on reducing energy instead of increasing performance, used for computations that cannot take advantage of hardware acceleration. All of these prior works [4, 14,15,16] on the dark silicon phenomena over the past 6 years focus on core rather than uncore components. Dorostkar et al. in [31] proposed an optimization problem to minimize energy consumption of uncore components in heterogeneous cache hierarchy and 3D NoC under power budget constraint.

In these days, providing analytical models for future multi/many-core CMPs in the presence of dark silicon is essential [6]. None of the previous studies have presented analytical models for the future CMPs. To the best of our knowledge, this is the first work which proposes an accurate power model that formulates the power consumption of 3D CMPs with stacked cache layers. Unlike the previous researches on power management techniques and dark silicon which consider only the portion of power consumption related to on-chip cores [4, 14,15,16], the proposed model considers the power impact of uncore components as important contributors in the total CMP power consumption in parallel with cores. This accurate power model can help researchers to propose new power management techniques in future CMPs.

In addition, we note that all the power budgeting techniques and performance optimization under a given power budget proposed so far in the multicore systems [25,26,27, 32,33,34] only focus on multi-programed workloads where each thread is a separate application. These models are inappropriate for multi-threaded applications. With increasing parallelization of applications from emerging domains such as recognition, mining, synthesis, and, particularly, mobile applications, this issue has become important that in future many-core architectures, workloads are expected to be multi-threaded applications. To the best of our knowledge, this is the first study that presents an accurate power model for both multi-programed and multi-threaded workloads.

Therefore, an analytical power model is extremely essential in order to verify that power budgets are met by different parts of CMP including cores and uncores with different technology and different performance or low-power techniques and also model power consumption in heterogeneous and homogeneous CMP under running both multi-threaded and multi-programed applications in future CMP. To the best of our knowledge, this is the first work which proposes an accurate power model that formulates the power consumption of 3D CMPs with stacked cache layers for both multi-programed and multi-threaded workloads.

4 The contribution of core and uncore components in total future multicore processor power consumption

In this section, we analyze the power consumption of core and uncore components in multicore systems. To better understand the power distribution of a multicore processor, we use McPAT [17] and evaluate the power dissipation of core and uncore components including L2/L3 cache levels, the routers and links of NoC, integrated memory controllers, and integrated I/O controllers.

In recent years, more and more applications are shifting from compute bounding to data bounding; therefore, a hierarchy of cache levels and data storage components to efficiently store and manipulate large amounts of data is required. In this context, an increasing percentage of on-chip transistors is invested on the uncore components and architects have dramatically increased the size of cache levels in cache hierarchy, in an attempt to bridge the gap between fast cores and slow off-chip memory accesses in multi/many-core CMPs. We select canneal as a representative of future memory-intensive applications in Fig. 2.

Figure 2 illustrates the power breakdown of a multicore system with increasing number of cores under limited power budget. Cores in this multicore platform are based on Niagra2 processors [35] with an additional shared L3 as the last-level cache (LLC).

The size of LLC increases with increasing number of cores as shown in Fig. 2. We assume multicore systems in this experiment run canneal application from PARSEC [36]. We use technology 32 nm in this study. As shown in this figure, the power consumption of uncore components becomes more critical when the number of cores is increased in a multicore system and the power budget is a design limitation. In this work, we assume idle cores can be gated off (dark silicon) while other on-chip resources stay active or idle under limited power budget. Actually, the uncore components remain active and consume power as long as there is an active core on the chip. As illustrated in Fig. 2, more than half of the power consumption is due to the uncore components in the 16-core and 32-core systems.

In addition, Fig. 3 illustrates when technology scales from 32 to 22 nm, the ratio of leakage power increases and is expected to exceed the dynamic power in the future generations. We use 1 GHz frequency and 0.9 V supply voltage for an 8-core system in 32- and 22-nm technologies in Fig. 3. This figure shows that leakage power dominates the power budget in the nanoscale technologies and is a major driver for unusable portion or dark silicon in future many-core CMPs.

In this context, for architecting new classes of low-power architectures, using emerging technologies such as NVMs with near-zero leakage power and three-dimensional integrated circuits (3D ICs) for stacking different technologies onto CMOS circuits brings new opportunities to the design of multi/many-core systems in the dark silicon era.

5 Target CMP architecture

With increasing parallelism levels of new applications (from emerging domains such as recognition, mining, synthesis, and especially mobile applications), which can efficiently use 100 to 1000 cores, shifting to multi/many-core designs has been aimed in recent years. Due to the scalability limitations and performance degradation problems in 2D CMPs, especially, in future many-cores, in this work, we focus on 3D integration to reduce global wire lengths and improve performance of future CMPs. For instance, Apple’s iPhone 4S is supposed to use the A5 processor, an SoC with two LPDDR2 SDRAM chips on top of the core layer, in the proposed system [37].

The architecture model assumed in this work is based on a 3D CMP with multi-level hybrid cache hierarchy stacked on the core layer similar to Fig. 4a. As shown in Fig. 4a, each cache level is assumed to be implemented using a different memory technology. For motivating about the proposed architecture for future CMPs in this paper, we design a scenario. In this scenario, we consider a 3D CMP with homogenous cache hierarchy as shown in Fig. 4b. In this scenario, we assume there is one layer per level in the homogenous cache hierarchy stacked on the core layer. Also, we assume there are four cores in the core layer, each of them running art application from SPEC 2000/2006 [38]. Figure 4b illustrates an example of the proposed architecture shown in Fig. 4a with four homogenous cache levels in the hierarchy and the core layer with four cores with more details about on-chip interconnection. Table 2 gives the properties of average memory access time (AMAT), as a suitable performance parameter for evaluation of the cache systems performance and system power consumption when the stacked cache levels in the homogenous hierarchy are made from SRAM, eDRAM, STT-RAM, or PRAM. Note that normalization reported in Table 2 is done based on the best case, that is, power consumption is normalized with respect to the SRAM, whereas AMAT is normalized with respect to the PRAM. Based on these views, SRAM is the fastest and a higher power-hungry option and it is better to be used in lower levels of the cache hierarchy to support faster accesses. According to the observations in Table 2, we decided to use SRAM in the L2 cache level, eDRAM in the L3 cache level, STT-RAM in the L4 cache level, and PRAM in the L5 cache level as shown in Fig. 4a. Details of all the experimental setup and power and performance estimation used in this motivation example will be shown in Section 7.

Table 2 Comparison of AMAT and system power consumption

Full size table

Because of strong thermal correlations between a core and cache banks directly stacked on the core, the core and the cache banks in the same stack is called a core set in our architecture (as shown in Fig. 5a).

6 Proposed power model for NoC-based CMPs with stacked cache hierarchy

In this section, we present an analytical power model for the 3D chip-multiprocessors (CMPs) with stacked cache hierarchy as future CMPs. Table 3 lists the parameters used in this model.

Table 3 Parameters used in the power model

Full size table

The total power consumption of a CMP mainly comes from three on-chip resources: cores, cache hierarchy, and interconnection network. CMPs with a large number of cores (more than eight) require building architectures through a scalable network-on-chip (NoC).

6.1 Components of the total power consumption of a 3D CMP

The total power consumption of a 3D CMP can be calculated as the sum of the power of individual on-chip resources (core and uncore components).

$$ \kern1.75em {P}_{\mathrm{Total}}={P}_{\mathrm{cores}}+{P}_{\mathrm{uncores}} $$

(1)

$$ {P}_{\mathrm{Total}}={P}_{\mathrm{cores}}+{P}_{\mathrm{cache}\_\mathrm{hierarchy}}+{P}_{\mathrm{interconnection}} $$

(2)

6.1.1 Modeling of core power consumption

We denote the power consumption of core i as $ {P}_i^{\mathrm{core}} $.

$$ {P}_{\mathrm{core}\mathrm{s}}=\sum \limits_{i=1}^n{P}_i^{\mathrm{core}} $$

(3)

The power consumption of core i is comprised of dynamic and leakage power components. The total power consumption of core i is written as:

$$ \kern1.75em {P}_i^{\mathrm{core}}={P}_{D,i}+{P}_{L,i},\kern2.25em \forall i $$

(4)

$$ \kern1.75em {P}_{D,i}={P}_{\mathrm{max}}\frac{{f_i}^2}{{f_{\mathrm{max}}}^2},\kern2.5em \forall i $$

(5)

Since operating voltage of a core depends on the operating frequency, it is assumed that the square of the voltage scales linearly with the frequency of operation [39]. In Eq. 5, P_max is maximum power budget and f_max is maximum frequency of the core.

The leakage power dissipation depends on temperature. The leakage power of core i can be written as Eq. 6. T_t is ambient temperature at time t and h_i is empirical coefficient for temperature-dependent leakage power dissipation. h_i coefficients in cores with the same microarchitectures have the same value. h_i is based on the thermal behavior of a core and is calculated as presented in [40, 41].

$$ {P}_{L,i}={h}_i\times {T}_t,\kern2.5em \forall i,t $$

(6)

In this work for core power modeling, we can consider the peak leakage power as other works [14, 15]. Therefore, in this model, we can use the maximum sustainable temperature for the chip.

$$ \kern1.75em {P}_{L,i}={h}_i\times {T}_{\mathrm{max}},\kern3.25em \forall i $$

(7)

6.1.2 Modeling of cache hierarchy power consumption

Cache hierarchy power consumption modeling for multi-programed workloads

As shown in Fig. 4a, the number of cache levels is N and each cache level is indexed as L_k, (k = 1, 2, 3, …, N). There are M_k layers in the kthcache level, L_k. The lth cache layer (l = 1, 2, 3, …, M_k) in the L_k is represented as A_{k, l}.

We assume that in multi-programed applications, each application mapped on each core effectively sees only its own slice of the dedicated cache banks in the cache hierarchy.

$$ {P}_{\mathrm{cache}\_\mathrm{hierarchy}}={\sum}_{i=1}^n{P}_i^{\mathrm{cache}\_\mathrm{hierarchy}} $$

(8)

$$ {P}_i^{\mathrm{cache}\_\mathrm{hierarchy}}={P}_{{\mathrm{dynamic}}_i}^{\mathrm{cache}\_\mathrm{hierarchy}}+{P}_{{\mathrm{static}}_i}^{\mathrm{cache}\_\mathrm{hierarchy}} $$

(9)

The first part of Eq. 9, $ {P}_{{\mathrm{dynamic}}_i}^{\mathrm{cache}\_\mathrm{hierarchy}} $, depends on dynamic energy. Dynamic energy consumed by cache depends on average memory access time (AMAT). Reducing AMAT leads to lower cache dynamic energy. Therefore, for formulating the first part of Eq. 9 based on accessible variables in the model, first we model the AMAT. The AMAT for a cache hierarchy with N levels is shown in Eq. 10. As shown in this equation, the AMAT is a function of miss rate and access time at each cache level.

$$ \mathrm{AMAT}={HiT}_1+\sum \limits_{k=1}^{N-1}{HiT}_{k+1}\times {R}_k^{\mathrm{miss}} $$

(10)

where HiT_k denotes hit time at the kth cache level and $ {R}_k^{\mathrm{miss}} $is the product of cache miss rates from the 1st to the kth cache level. The average HiT_k at the kth cache level is computed as Eq. 11 due to the different access time of reading and writing in non-volatile memories (i.e., STTRAM-based or PRAM-based cache):

$$ {HiT}_k=\frac{a_k^r\times {\tau}_k^r+{a}_k^w\times {\tau}_k^w}{a_k^r+{a}_k^w} $$

(11)

where $ {a}_k^r $ and $ {a}_k^w $ are the number of read and write accesses of the running program at the kth cache level, respectively. $ {\tau}_k^r $ and $ {\tau}_k^w $ are latencies of read and write at the kth cache level.

In this trend, we can compute the average power per access (APPA) by:

$$ \mathrm{APPA}={APPH}_1+\sum \limits_{k=1}^{N-1}{APPH}_{k+1}\times {R}_k^{\mathrm{miss}} $$

(12)

$$ {APPH}_k=\frac{a^r\times {\tau}_k^r\times {p}_k^r+{a}^w\times {\tau}_k^w\times {p}_k^w}{a^r+{a}^w} $$

(13)

where $ {p}_k^r $ and $ {p}_k^w $ are power consumption of read and write at the kth cache level, respectively. We can rewrite Eq. 13 as:

$$ {APPH}_k=\frac{a_k^r\times {E_{\mathrm{read}}}_k+{a}_k^r\times {E_{\mathrm{write}}}_k}{a_k^r+{a}_k^r} $$

(14)

where E_readk and E_writek are read and write energy at the kth cache level, respectively.

$$ {R}_k^{\mathrm{miss}}=\mu \times {\left(\frac{B_k}{\sigma}\right)}^{-\alpha } $$

(15)

where σ is baseline cache size. μ is baseline cache miss rate. α is power law exponent and typically lies between 0.3 and 0.7 [42]. B_k is the sum of allocated cache capacity from the 1st to the kth cache level and is obtained by:

$$ {B}_k=\sum \limits_{m=1}^k{c}_m\times {b}_m $$

(16)

where c_m and b_m are the capacity of each cache layer and the number of active cache layers at the mth cache level, respectively.

We can rewrite the first part of Eq. 9, $ {P}_{{\mathrm{dynamic}}_i}^{\mathrm{cache}\_\mathrm{hierarchy}} $, based on the accessible variables as:

$$ {P}_{{\mathrm{dynamic}}_i}^{\mathrm{cache}\_\mathrm{hierarchy}}=\gamma \times \left({APPH}_1+\sum \limits_{k=1}^{N-1}{APPH}_{k+1}\times \mu \times {\left(\frac{B_{i,k}}{\sigma}\right)}^{-\alpha}\right) $$

(17)

where γ is the number of accesses to cache layer per second. In Eq. 18, d_i is a constraint that shows the time-to-deadline of the program allocated to core i.

$$ \gamma =\frac{a^r+{a}^w}{d_i} $$

(18)

As one of the worst cases, we can assume all of the accesses of the mapped application are to the Nth cache level of the hierarchy with biggest latency. Therefore, we can set d_i as:

$$ \kern1.75em {d}_i={a}^r\times {\tau}_N^r+{a}^w\times {\tau}_N^w $$

(19)

The second part of Eq. 9, $ {P}_{{\mathrm{static}}_i}^{\mathrm{cache}\_\mathrm{hierarchy}} $, is the total leakage power consumption related to the dedicated cache banks to core i which is the main contributor to the total power consumption.

$$ {P}_{{\mathrm{static}}_i}^{\mathrm{cache}\_\mathrm{hierarchy}}=\sum \limits_{k=1}^N{b}_{i,k}\times {P}_{{\mathrm{static}}_k}\left({T}_{\mathrm{max}}\right) $$

(20)

$$ \kern1.75em {b}_{i,k}=\frac{B_{i,k}-{B}_{i,k-1}}{c_k} $$

(21)

In Eq. 20, $ {P}_{{\mathrm{static}}_k}\left({T}_{\mathrm{max}}\right) $ is the static power consumed by each layer of the kth cache level, L_k, at temperature T_max. Equation 21 shows the number of active cache layers in the region set bank i stacked on core i at the kth cache level, b_{i, k}, that is proportional to the difference between accumulated cache capacity at the kth cache level, B_{i, k}, and that at the (k − 1)th level, B_{i, k − 1}. c_k shows the capacity of the kth cache level, L_k.

Cache hierarchy power consumption modeling for multi-threaded workloads

Equations 8–21 model cache power consumption in multi-programed workloads which each program only using the dedicated cache banks in its own core set privately as shown in Fig. 5a. Larger classes of multi-threaded applications are based on barrier synchronization and consist of two phases of execution (shown in Fig. 6): a sequential phase, which consists of a single thread of execution, and a parallel phase in which multiple threads process data in parallel. The parallel threads of execution in a parallel phase typically synchronize on a barrier. In parallel phase, all threads must finish execution before the application can proceed to the next phase. In multi-threaded workloads, cache levels are shared across the threads. In parallel phase, threads share regions at each layer of the cache levels in the hierarchy as shown in Fig. 5b. For example, for a performance-maximization problem with respect to power budget, first, we dedicate region 1 in each level to the threads, as an initialized value. Then based on power budget and other constraints in the optimization problem, we can increase the number of regions or keep it fixed in each level in order to obtain the maximum performance.

Since multi-threaded applications use cache hierarchy in shared style, we can rewrite Eq. 9 for them as follows:

$$ {P}_{\mathrm{cache}\_\mathrm{hierarchy}}={\mathrm{P}}_{\mathrm{dynamic}}^{\mathrm{cache}\_\mathrm{hierarchy}}+{P}_{\mathrm{static}}^{\mathrm{cache}\_\mathrm{hierarchy}} $$

(22)

Because of the impact of multi-threaded data sharing on the cache miss rate, we replace Eq. 15 with Eq. 23:

$$ {R}_{j,k}^{\mathrm{miss}}=\mu \times {\left(\frac{B_{j,k}}{n\times \sigma}\right)}^{-\alpha}\times {E}_n $$

(23)

where E_n is data sharing impact on miss rate. n is number of cores in the core layer. μ and σ are the same as these parameters in Eq. 15 [42].

$$ {B}_k=\sum \limits_{m=1}^k\sum \limits_{j=1}^{regn_m}j\times {x}_{j,k}\times \frac{c_m}{regn_m} $$

(24)

$$ \sum \limits_{j=1}^{regn_k}{x}_{j,k}=1,\kern0.75em \forall k $$

(25)

Let x_{j, k}, x_{j, k} ∈ {0, 1}, j ∈ [1, regn_k], k ∈ [1, N], be a binary variable. x_{j, k} is set to 1, when the multi-threaded application uses j regions (region 1, region 2, …, region j − 1 and region j) at the kth cache level. Its optimal value is founded by performance-maximization problem. Note that regn_k represents the total number of regions in the kth cache level of the hierarchy. Equation 25 guarantees that the value of x_{j, k} is unique and just j regions is used at the kth cache level.

The first part of Eq. 22, $ {P}_{\mathrm{dynamic}}^{\mathrm{cache}\_\mathrm{hierarchy}} $, based on accessible variables, is as follows:

$$ {P}_{\mathrm{dynamic}}^{\mathrm{cache}\_\mathrm{hierarchy}}=\gamma \times \left({APPH}_1+\sum \limits_{k=1}^{N-1}\sum \limits_{j=1}^{regn_k}\left({APPH}_{k+1}\times \mu {\left(\frac{B_k}{n\times \sigma}\right)}^{-\alpha}\times {E}_n\ \right)\right) $$

(26)

where γ is the number of accesses per second in Eq. 26 and is computed by using Eqs. 18 and 19.

The second part of Eq. 22, $ {P}_{\mathrm{static}}^{\mathrm{cache}\_\mathrm{hierarchy}} $, can be modeled as follows:

$$ {P}_{\mathrm{static}}^{\mathrm{cache}\_\mathrm{hierarchy}}=\sum \limits_{k=1}^N\sum \limits_{j=1}^{regn_k}j\times {x}_{j,k}\times {P}_{{\mathrm{static}}_{regk}}\left({T}_{\mathrm{max}}\right) $$

(27)

Equations 12, 13, and 14 in the multi-programed cache power modeling are repeated again for a multi-threaded workload.

6.1.3 Modeling of on-chip interconnection power consumption

Energy consumption of the on-chip interconnection network in the total execution time of mapped workload, T_s, is calculated by Eq. 28 [43], which contains static energy of an interconnection network and dynamic energy of transferring packets to the network.

$$ {E}_{\mathrm{interconnection}}^s={E}_{\mathrm{static}}+{E}_{\mathrm{dynamic}}={P}_{n,{n}^{\prime },{n}^{\prime \prime}}^q\times {T}_s+{E}_{NP}^s $$

(28)

$$ {\displaystyle \begin{array}{l}{E}_{NP}^s= NP\times {E}_1^s= NP\times \left({D}_{\mathrm{mesh}}+1\right)\times {E}_R^P\\ {}\kern1.74em = NP\times \left({D}_{\mathrm{mesh}}+1\right)\times l\times {E}_R^f\end{array}} $$

(29)

Total dynamic energy dissipation contains energy dissipated for transferring NP packets, where each packet dissipates $ {E}_1^s $ on average for transferring from the source to the destination in the on-chip interconnection network. When one packet is forwarded from the source to the destination, on average, it goes across D_mesh + 1 routers and links ($ {E}_R^P $ is average constant energy dissipated in a router and the related link for a packet transferring). It should be noted that a packet contains l flits and in this context, $ {E}_R^f $ is the average of energy dissipated in a router and the related link for a flit transferring. Therefore, to transfer NP packets in T_s in the on-chip interconnection network, dynamic energy dissipation $ \left({E}_{NP}^s\right) $ of an on-chip interconnection will be formulated as Eq. 29.

In a mesh topology with d dimensions, where there are k_i nodes in the ith dimension, the average distance that a packet must traverse to reach the destination can be calculated by Eq. 30 [44]:

$$ \kern1.75em {D}_{\mathrm{mesh}}=\frac{1}{3}\times {\sum}_{i=1}^d\left({k}_i-\frac{1}{k_i}\right) $$

(30)

In a 2D mesh with n nodes in each dimension (d = 2 and k_{1, 2} = n), the average distance between two nodes can be calculated as follows:

$$ \kern1.75em {D}_{\mathrm{mesh}}=\frac{2n}{3}-\frac{2}{3n} $$

(31)

In a many-core platform based on 2D mesh topology (n ≥ 32), the value of the second part in Eq. 31 will be negligible and can be ignored. Therefore, the average distance is:

$$ \kern1.75em {D}_{\mathrm{mesh}}\cong \frac{2n}{3} $$

(32)

$ {P}_{n,n\hbox{'},n"}^q $ is the static power consumption of an interconnection network based on mesh topology with n nodes in dimension 1, n^′ nodes in dimension 2, and n^′′ nodes in dimension 3 and contains power consumption of TSVs, links, and routers without packets. There is n × n^' × n^" routers with v virtual channels, n × n^′ links on the core layer, and TSV TSVs in the 3D network on chip.

Finally, power consumption of on-chip interconnection between nodes can be calculated as:

$$ {\displaystyle \begin{array}{l}{P}_{\mathrm{interconnection}}=\frac{E_{\mathrm{interconnection}}^s}{T_s}={P}_{n,{n}^{\hbox{'}},{n}^{\hbox{'}\hbox{'}}}^q+\frac{E_{NP}^s}{T_s}\kern2.25em \\ {}=\left(n\times {n}^{\hbox{'}}\times {n}^{\hbox{'}\hbox{'}}\times {P}_R^{qC}+{P}_{\mathrm{Links}}^{\mathrm{static}}+{P}_{TSV s}^{\mathrm{static}}\right)+\frac{E_{NP}^s}{T_s}\\ {}=n\times {n}^{\hbox{'}}\times {n}^{\hbox{'}\hbox{'}}\times \nu \times {P}_R^c+n\times {n}^{\hbox{'}}\times {P}_{\mathrm{link}}^c+ TSV\times {P}_{TSV}^c+\frac{E_{NP}^s}{T_s}\end{array}} $$

(33)

Since Eq. 33 is the function of total execution time of the mapped applications, T_s, and T_s has a big value compare to E_NP, the last term of Eq. 31 can be ignored; therefore,

$$ {P}_{\mathrm{interconnection}}=n\times {n}^{\prime}\times {n}^{\prime \prime}\times \nu \times {P}_R^c+n\times {n}^{\prime}\times {P}_{link}^c+ TSV\times {P}_{TSV}^c $$

(34)

As described in [45,46,47], also based on observation from Fig. 3, particularly problematic for NoC structures is leakage power, which is dissipated regardless of communication activity. At high network utilization, static power may comprise more than 75% of the total NoC power at the 22-nm technology and this percentage is expected to increase in future technology generations. This fact is captured by Eq. 34.

6.2 Dark silicon constraint

Equations 35 and 36 represent the dark silicon constraints for CMPs with multi-programed and multi-threaded workloads when, for example, the goal is maximizing performance of the system. Maximizing performance under power constraint is an important target in digital system design in these days. The peak power dissipation during the entire execution must be less than the maximum power budget.

$$ \sum \limits_{i=1}^n{P}_i^{\mathrm{core}}+\sum \limits_{i=1}^n{P}_i^{\mathrm{cache}\_\mathrm{hierarchy}}+{P}_{\mathrm{interconnection}}\le {P}_{\mathrm{budget}} $$

(35)

$$ \sum \limits_{i=1}^n{P}_i^{\mathrm{core}}+\sum \limits_{k=1}^N\sum \limits_{j=1}^{regn_k}j.{x}_{j,k}.{P}_j^{\mathrm{cache}\_\mathrm{hierarchy}}+{P}_{\mathrm{interconnection}}\le {P}_{\mathrm{budget}} $$

(36)

Equations 35 and 36 can be used in design time and run time optimization techniques and other power management methods to combat dark silicon.

The proposed model is linear polynomial since all formulas are linear and degree of variables is one. To solve the models, we use CVX [48], an efficient convex optimization solver. Solving this model can be exhaustively done (in polynomial time) to determine the best solution that maximizes performance within the dark silicon peak power budget. The overall runtime overhead for this polynomial computation is negligible in our experiment.

7 Experimental evaluation

7.1 Platform setup

In order to validate the efficiency of 3D CMP architectures in this work, we employed a detailed simulation framework driven by traces extracted from real application workloads running on a full-system simulator. The traces have been extracted from the GEM5 full-system simulator [17]. For simulating a 3D CMP architecture, the extracted traces from GEM5 were interfaced with 3D Noxim, as a 3D NoC simulator [49]. GEM5 was augmented with McPAT and 3D Noxim with ORION [50] to calculate the power consumption in this paper. The cache capacities and energy consumption of SRAM and NVMs were estimated from CACTI and NVSIM [22], respectively. A full-system simulation of a 16-core CMP architecture with three cache levels in the hierarchy at the 32-nm technology is performed for evaluation in this work. In each cache level of the stacked hierarchy, there are three layers. In the hybrid architecture, the capacity of each layer of L2 cache is 16 × 1 MB SRAM bank, the capacity of each layer of L3 cache is 16 × 4 MB eDRAM bank, and the capacity of each layer of L4 cache is determined 16 × 4 MB STT-RAM bank. In the baseline architecture, the capacity of each layer of L2, L3, and L4 caches is 16 × 1 MB SRAM bank. The detailed properties of cache banks in different technologies are listed in Table 1. The system configuration used for evaluation in this work is listed in Table 4.

Table 4 Specification of CMP configurations evaluated in this work

Full size table

We use multi-programed workloads consisting of 16 applications and multi-threaded workloads with 16 threads for performing our experiments. The applications in multi-programed workloads are selected from the SPEC2000/2006 benchmark suites [38]. Based on memory demand intensity of benchmark applications, we classified them into three groups: memory-bounded, medium, and computation-bounded benchmarks. From this classification, we generated a range of workloads (combinations of 16 benchmarks), as summarized in Table 5. Note that the number in parentheses is the number of instances. In our setup, programs in a given workload are randomly mapped to cores to avoid a specific OS policy. Table 6 summarizes the multi-threaded workloads based on PARSEC [51] used in this work.

Table 5 Multi-programed workloads used in the experiment

Full size table

Table 6 Multi-threaded workloads used in the experiment

Full size table

7.2 Experimental results

In this subsection, we evaluate the target 3D CMP with stacked cache hierarchy in two different cases: the CMP with SRAM-only stacked cache levels on the core layer (baseline) and the proposed CMP with hybrid stacked cache levels on the core layer (hybrid). Results extracted from the simulations are compared with those obtained from the analytical model. We define a new parameter named LLC_Util which shows the utilization of the last-level cache in the hierarchy and is the best parameter to show the workload characteristics among the other parameters. Workloads with LLC_Util less than 5% are computation intensive. Tables 7 and 8 show used workload characteristics based on the related LLC_Util parameter, number of cache hits, number of cache misses, packet latency, and number of packets transferred. When the utilization of the last-level cache of the hierarchy is high (LLC_Util > 5%), the number of cache miss rate increases and the workload needs a larger cache capacity to better fit the active memory footprint.

Table 7 Workload characteristics (cache hierarchy)

Full size table

Table 8 Workload characteristics (NoC)

Full size table

For the SPEC benchmarks, we fast-forward 500M instructions and run in detailed mode for the next 1 billion instructions. For PARSEC benchmarks, we run 1 billion instructions starting from the region of interest (ROI), using the simlarge input set. We used Ruby in the Gem5 that considers stalls in cores and blocking time effect in generating traces for the workloads. The proposed model considers the stalls and packet blocking time as well according to the use of the concept of stall time and blocking effect modelling in recent studies [43, 52].

Figures 7 and 8 compare the result of power consumption for the simulation and analytical model of baseline and proposed architecture under running both multi-threaded and multi-program workload, respectively.

According to Table 7, canneal and MB1 applications, with the largest LLC_Util, are memory-intensive workloads which utilize the last-level cache heavily. In these applications, as shown in Figs. 7a and 8a, cache hierarchy consumes more power consumption compared with cores because cores are mostly in stall stage. Swaption and CB2 applications, with the smallest LLC_Util, are computation-intensive workloads and, as shown in Figs. 7a and 8a, have higher core power consumption compared with other workloads. Compared with baseline CMP, the proposed hybrid CMP improves the power consumption of cores, cache hierarchy, and on-chip interconnection by about 6.3, 22.5, and 5.0% on average under execution of multi-threaded workload. The hybrid CMP improves cores and cache hierarchy power consumption by about 6.14 and 24.14%, respectively, and worsens the on-chip interconnection power consumption by about 0.14% on average that is negligible.

In this trend, we evaluate the scalability of our proposed model for a 64-core CMP as shown in Figs. 7b and 8b. By increasing the number of cores in memory-intensive workloads, the increasing of power consumption is much higher due to higher uncore power consumption in comparison with computation-intensive applications in both multi-threaded workloads and multi-programed workloads. It should be noted that the power consumption of the target architecture is limited under power budget and temperature limit that are given by a designer-specified value.

As shown in Figs. 7 and 8, the proposed power model estimates the power consumption of heterogeneous (hybrid) and homogenous (baseline) 3D CMPs, with a good degree of accuracy, under running both multi-programed and multi-threaded workloads. Tables 9 and 10 show the difference of values between the simulation and proposed model for both multi-threaded and multi-program workloads in a 16-core CMP. To evaluate the degree of accuracy, we calculate standard deviation (STDEV) of the simulation and proposed model under different benchmarks and architectures. As reported in Table 11, the value of the proposed model is truly close to the value of the simulation. The STDEV of the baseline and hybrid CMP is negligible and about 0.0198 and 0.0153 for estimation of core power under running multi-threaded workloads for a 16-core CMP. In addition, we estimate the cache hierarchy power consumption with STDEV of 0.0472 and 0.159 for the baseline CMP and about 0.0345 and 0.0364 for hybrid CMP under the multi-threaded and multi-program workloads.

Table 9 Difference of simulation and proposed model under multi-threaded workloads for a 16-core CMP

Full size table

Table 10 Difference of simulation and proposed model under multi-programed workloads for a 16-core CMP

Full size table

Table 11 Standard deviation of the simulation and proposed model

Full size table

8 Conclusions

In this work, we proposed an accurate power model that formulates the power consumption of 3D CMPs with stacked cache layers. The proposed model that considers power impact of uncore beside the cores for the first time is appropriate for heterogeneous and non-heterogeneous CMPs under multi-threaded and multi-programed workloads. In the future, we can use this model in the optimization problems to minimize power consumption or maximize performance of CMPs under latency and temperature constraints. Moreover, we can use this power model in the prediction functions of machine learning-based power/thermal management techniques for future power-aware CMPs.

References

Kao, J, Narendra, S, Chandrakasan, A (2002). Subthreshold leakage modeling and reduction techniques. In IEEE/ACM Int. Conf. Comput.-aided design (ICCAD) (pp. 141–148).
Kim, NS, et al. (2003). Leakage current: Moore’s law meets static power. Computer, 36(12), 68–75.
Article Google Scholar
Wang, W, & Mishra, P. (2012). System-wide leakage-aware energy minimization using dynamic voltage scaling and cache reconfiguration in multitasking systems. IEEE Trans. Very Large Scale Integr. VLSI Syst., 20(5), 902–910.
Article Google Scholar
Esmaeilzadeh, H, Blem, E, St. Amant, R, Sankaralingam, K, Burger, D (2011). Dark silicon and the end of multicore scaling. In Proc. Int. Symp. Comput. Archit. (pp. 365–376).
Henkel, J, Khdr, H, Pagani, S, & Shafique, M (2015). New trends in dark silicon. In Proc. Design Automation Conf. (DAC), (pp. 1–6).
Bose, P. (2013). Is dark silicon real?: technical perspective. Commun. ACM, 56(2), 92.
Article Google Scholar
Taylor, MB (2012). Is dark silicon useful?: harnessing the four horsemen of the coming dark silicon apocalypse. In Proc. Design Autom. Conf. (DAC), (pp. 1131–1136).
Jammy, R (2009). Materials, process and integration options for emerging technologies. In SEMATECH/ISMI symposium.
Google Scholar
Woo, DH, Seong, NH, Lewis, DL, Lee, H-HS (2010). An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In Int. Symp. High Perf. Comput. Arch. (HPCA), (pp. 1–12).
Loh, GH (2008). 3D-Stacked Memory Architectures for Multi-core Processors. In Int. Symp. .Comput. Arch. (ISCA), (pp. 453–464).
Kultursay, E, Kandemir, M, Sivasubramaniam, A, Mutlu, O (2013). Evaluating STT-RAM as an energy-efficient main memory alternative. In Int. Symp. Performance Analysis of Systems and Software (ISPASS), (pp. 256–267).
Lee, BC, et al. (2010). Phase-change technology and the future of main memory. IEEE Micro, 30(1), 143–143.
Article Google Scholar
Diao, Z, et al. (2007). Spin-transfer torque switching in magnetic tunnel junctions and spin-transfer torque random access memory. J. Phys. Condens. Matter, 19(16), 165–209.
Turakhia, Y, Raghunathan, B, Garg, S, Marculescu, D (2013). HaDeS: architectural synthesis for heterogeneous dark silicon chip multi-processors. In Design Autom. Conf. (DAC), (p. 1).
Raghunathan, B, Turakhia, Y, Garg, S, Marculescu, D (2013). Cherry-picking: exploiting process variations in dark-silicon homogeneous chip multi-processors. In Design, Autom. Test in Europe Conf. & Exhibition (DATE), (pp. 39–44).
Venkatesh, G, Sampson, J, Goulding-Hotta, N, Venkata, SK, Taylor, MB, Swanson, S (2011). QsCores: trading dark silicon for scalable energy efficiency with quasi-specific cores. In Proc. Int. Symp. Microarchitecture, (p. 163).
Li, S, Ahn, JH, Strong, RD, Brockman, JB, Tullsen, DM, Jouppi, NP (2009). McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. Int. Sym. Microarchitecture, (pp. 469–480).
Binkert, N, et al. (2011). The gem5 simulator. ACM SIGARCH Comput. Archit. News, 39(2), 1.
Article Google Scholar
Poremba, M, Zhang, T, Xie, Y. (2015). NVMain 2.0: a user-friendly memory simulator to model (non-)volatile memory systems. IEEE Comput. Archit. Lett., 14(2), 140–143.
Article Google Scholar
Chang M-T, Rosenfeld, P, Lu S-L, Jacob, B (2013). Technology comparison for large last-level caches (L3Cs): Low-leakage SRAM, low write-energy STT-RAM, and refresh-optimized eDRAM. In Int. Symp. High Perfor. Comput. Archit. (HPCA), (pp. 143–154).
Chen, Y-T, et al. (2012). Dynamically reconfigurable hybrid cache: An energy-efficient last-level cache design. In Design, Autom.n Test in Europe Conf. Exhib. (DATE), (pp. 45–50).
Dong, X, Xu, C, Jouppi, N, Xie, Y (2014). NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Non-volatile Memory. In Y. Xie (Ed.), Emerging Memory Technologies, (pp. 15–50).
Muralimanohar, N, Balasubramonian, R, Jouppi, NP. (2009). CACTI 6.0: a tool to model large caches. HP Lab., 22–31.
Wonyoung Kim, Gupta, M-S, Wei, G-Y, Brooks, D (2008). System level analysis of fast, per-core DVFS using on-chip switching regulators. In Int. Symp. High Perfor. Comput. Archit. (HPCA), (pp. 123–134).
Wang, Y, Ma, K, Wang, X. (2009). Temperature-constrained power control for chip multiprocessors with online model estimation. ACM SIGARCH Comput. Archit. News, 37(3), 314.
Article MathSciNet Google Scholar
Heo, S, Barr, K, Asanović, K (2003). Reducing power density through activity migration. In Int. Symp. Low Power Electronics Design, (p. 217).
Chantem, T, Hu, XS, Dick, RP. (2011). Temperature-aware scheduling and assignment for hard real-time applications on MPSoCs. IEEE Trans. Very Large Scale Integr. VLSI Syst., 19(10), 1884–1897.
Article Google Scholar
Ebi, T, Al Faruque, MA, Henkel, J (2009). TAPE: thermal-aware agent-based power economy for multi/many-core architectures. In IEEE/ACM Int. Conf. Comput.-Aided Design-Digest of Technical Papers, (p. 302).
Ebi, T, Kramer, D, Karl, W, Henkel, J (2011). Economic learning for thermal-aware power budgeting in many-core architectures. In Proc. Int. Conf. Hardware/Software Codesign and Syst. Synthesis (CODES+ISSS), (p. 189).
Al Faruque, MA, Jahn, J, Ebi, T, Henkel, J. (2010). Runtime thermal management using software agents for multi- and many-core architectures. IEEE Des. Test Comput., 27(6), 58–68.
Article Google Scholar
Dorostkar, A, Asad, A, Fathy, M, Mohammadi, F (2017). Optimal Placement of Heterogeneous Uncore Component in 3D Chip-Multiprocessors. In Euromicro Conf. Digital System Design (DSD), (pp. 547–551).
Shelepov, D, et al. (2009). HASS: a scheduler for heterogeneous multicore systems. ACM SIGOPS Oper. Syst. Rev, 43(2), 66.
Article Google Scholar
Asad, A, Ozturk, O, Fathy, M, & Jahed-Motlagh, MR. (2015). Exploiting Heterogeneity in Cache Hierarchy in Dark-Silicon 3D Chip Multi-processors. In Euromicro Conf. Digital Syst. Design (DSD) (pp. 314–321).
Sharifi, A, Mishra, AK, Srikantaiah, S, Kandemir, M, Das, CR (2012). PEPON: performance-aware hierarchical power budgeting for NoC based multicores. In Proc. Int. Conf. Parallel archit. compilation techn. (p. 65).
Nawathe, UG, Hassan, M, Yen, KC, Kumar, A, Ramachandran, A, Greenhill, D. (2008). Implementation of an 8-Core, 64-thread, power-efficient SPARC server on a chip. IEEE J. Solid State Circuits, 43(1), 6–20.
Article Google Scholar
Gebhart, M, Hestness, J, Fatehi, E, Gratz, P, & Kwcker, SW. (2009). “Running PARSEC 2.1 on M5”. Univ. Te. Austin, Dep. Comput. Sci.e, Tech. Rep.
Shimpi, AL, Klug, B. (2011, Oct.) Apple iphone 4s: Thoroughly reviewed. [Online]. Available: http://www.anandtech.com/show/4971/apple-iphone-4s-review-att-verizon/5.
“Standard performance evaluation corporation” [Online], Available: http://www.specbench.org.
Murali, S, Coenen, M, Radulescu, A, Goossens, K, De Micheli, G (2006). Mapping and configuration methods for multi-use-case networks on chips. In Proc. Asia South Pacific Design Autom. Conf. (ASP-DAC), (pp. 146–151).
Skadron, K, Stan, MR, Sankaranarayanan, K, Huang, W, Velusamy, S, Tarjan, D. (2004). Temperature-aware microarchitecture: modeling and implementation. ACM Trans. Archit. Code Optim., 1(1), 94–125.
Article Google Scholar
Su, H, Liu, F, Devgan, A, Acar, E, Nassif, S (2003). Full chip leakage estimation considering power supply and temperature variations. In Proc. Int. Symp. Low power electronics design (ISLPED), (pp. 78–83).
Hartstein, A, Srinivasan, V, Puzak, TR, Emma, PG (2006). Cache miss behavior: is it √2? In Proc. Conf. Computing frontiers (CF), (pp. 313–320).
Asad, A, Zonouz, AE, Seyrafi, M, Soryani, M, Fathy, M (2009). Modeling and Analyzing of Blocking Time Effects on Power Consumption in Network-on-Chips. In Int. Con. Reconfig. Computing FPGAs (ReConFig), (pp. 290–295).
Shen, Z. (2002). The calculation of average distance in mesh structures. Comput. Math. Appl., 44(10–11), 1379–1402.
Article MathSciNet MATH Google Scholar
Zhan, J, Xie, Y, Sun, G (2014). NoC-Sprinting: Interconnect for Fine-Grained Sprinting in the Dark Silicon Era. In ACM/EDAC/IEEE Design Autom. Conf. (DAC), (pp. 1–6).
Zhan, J, Ouyang, J, Ge, F, Zhao, J, Xie, Y (2015). DimNoC: a dim silicon approach towards power-efficient on-chip network. In ACM/EDAC/IEEE Design Autom. Conf. (DAC), (pp. 1–6).
Sun, C, et al. (2012). DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling. In IEEE/ACM Int. Symp. Net. on Chip (NoCS), (pp. 201–210).
Grant, M, Boyd, S, Ye, Y. CVX: Matlab software for disciplined convex programming. [online] Available: http://cvxr.com/cvx/.
Catania, V, Mineo, A, Monteleone, S, Palesi, M, Patti, D (2015). Noxim: An open, extensible and cycle-accurate network on chip simulator. In Int. Conf. Application-specific Syst., Archit. Processors (ASAP), (pp. 162–163).
Kahng, AB, Lin, B, Nath, S. (2015). ORION3.0: a comprehensive NoC router estimation tool. IEEE Embed. Syst. Lett., 7(2), 41–45.
Article Google Scholar
Bienia, C, Kumar, S, Singh, JP, Li, K (2008). The PARSEC benchmark suite: characterization and architectural implications. In Int. Conf. Parallel Archit. Compilation Techn. (PACT), (p. 72).
Ogras, UY, Marculescu, R (2007). Analytical Router Modeling for Networks-on-Chip Performance Analysis. In Design, Autom. Test Euro. Conf. Exhib. (DATE), (pp. 1–6).
Krishna, A, Samih, A, Solihin, Y. (2012). Data sharing in multi-threaded applications and its impact on chip design. In Int Symp Perfor. Analysis Sys. Software (ISPASS), (pp. 125–134).

Download references

Author information

Authors and Affiliations

Electrical and Computer Engineering Department, Ryerson University, 350 Victoria Street, Toronto, Ontario, M5B 2K3, Canada
Arghavan Asad, Aniseh Dorostkar & Farah Mohammadi

Authors

Arghavan Asad
View author publications
You can also search for this author in PubMed Google Scholar
Aniseh Dorostkar
View author publications
You can also search for this author in PubMed Google Scholar
Farah Mohammadi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

AA, the first author of this paper, proposed the mentioned power model for future embedded chip-multiprocessors. She implemented a platform setup for this work for the first time under the supervision of Professor FM, the third author of this paper. Professor FM participated in the evaluation of the proposed model and helped to draft the manuscript. All authors read and approved the final manuscript. AD as the second author of this paper helped to prepare the simulation results of this work.

Corresponding author

Correspondence to Arghavan Asad.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Asad, A., Dorostkar, A. & Mohammadi, F. A novel power model for future heterogeneous 3D chip-multiprocessors in the dark silicon age. J Embedded Systems 2018, 3 (2018). https://doi.org/10.1186/s13639-018-0086-1

Download citation

Received: 23 August 2017
Accepted: 02 July 2018
Published: 27 July 2018
DOI: https://doi.org/10.1186/s13639-018-0086-1

A novel power model for future heterogeneous 3D chip-multiprocessors in the dark silicon age

Abstract

1 Introduction

2 Background

3 Related work

4 The contribution of core and uncore components in total future multicore processor power consumption

5 Target CMP architecture

6 Proposed power model for NoC-based CMPs with stacked cache hierarchy

6.1 Components of the total power consumption of a 3D CMP

6.1.1 Modeling of core power consumption

6.1.2 Modeling of cache hierarchy power consumption

Cache hierarchy power consumption modeling for multi-programed workloads

Cache hierarchy power consumption modeling for multi-threaded workloads

6.1.3 Modeling of on-chip interconnection power consumption

6.2 Dark silicon constraint

7 Experimental evaluation

7.1 Platform setup

7.2 Experimental results

8 Conclusions

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords