Dual-access way-prediction cache for embedded systems
© Chu and Park; licensee Springer. 2014
Received: 17 July 2013
Accepted: 26 April 2014
Published: 20 May 2014
Way-prediction (WP) caches have advantages of reducing power consumption and latency for highly associative data caches and thus are favorable for embedded systems. In this paper, we propose an enhanced way-prediction cache, dual-access way-prediction (DAWP) cache, to cope with the weakness of the WP cache. The prediction logic designed for the DAWP cache contains a scaled index table, a global history register, and a fully associative cache to achieve higher prediction accuracy, which eventually yields less energy consumption and latency. In our practice, performance measurement is done with a simulation model, which is implemented with SimpleScalar and CACTI, and nine SPEC2000 benchmark programs. Our experimental results show that the proposed DAWP cache is highly efficient in power and latency for highly associative cache structures. The efficiency is increased with the increasing associativity, and the testing results with 64 KB cache show that the DAWP cache achieves 16.45% ~ 75.85% power gain and 4.91% ~ 26.96% latency gain for 2-way ~ 32-way structures, respectively. It is also observed that the random replacement policy yields better efficiency in power and latency than the LRU (least recently used) policy with the DAWP cache.
Since last decade, low-power system design has been a hot issue for embedded systems, especially for hand-held devices, which are dependent on the limited battery power. With the most of embedded systems, the major source of energy consumption has been known as microprocessor and cache. In fact, it is reported that cache memories occupy more than 60% of the microprocessors’ die area and consume more than 40% of the total system power [1–5]. To reduce the power dissipation in an embedded system, it is highly desired to design and use an energy-efficient cache system.
In general, on-chip caches used in mobile devices are highly associative with more than 16-way sets to provide better performance by reducing the costly access to the memory. The highly associative caches proposed in [6–8] are specifically designed for embedded systems to provide better performance by reducing the conflict misses, which are due to imperfect allocations of entries in the cache. However, they significantly increase the system power consumption due to the simultaneous accesses to all the banks in the cache, e.g., n-way set-associative cache has n banks to be accessed simultaneously. A basic method of reducing power consumption in a cache system is reducing the number of bank accesses to charge the bit-lines of the cache memory [4, 9]. In this paper, we aim to reduce the power consumption of the highly associative caches used in embedded systems by accurately predicting the target bank, i.e., accessing only one bank from all the banks in the cache, to access the referenced data. The resulting cache system is named dual-access way-prediction (DAWP) cache and it not only saves the power but also reduces the latency since the cache is accessed as a direct-mapped cache when the prediction is hit.
The rest of this paper is organized as follows. A brief review of some related work is provided in Section 2. In Section 3, the proposed dual-access way-prediction cache is described. In Section 4, simulation model and performance measurement metrics used in our practice are described. In Section 5, experimental results and discussions are provided, and finally, Section 6 concludes the paper.
2. Related work
Researchers have proposed various methods to reduce the power consumption of highly associative cache memories. One popular approach is using a phased cache in which the cache is divided into two parts, i.e., tag part and data part [10, 11]. In the phased cache, tag bits for all tag banks are enabled (powered) and checked with the memory reference. On a hit in a bank, the bank is enabled and accessed during the next cycle. Although the phased cache can reduce the power consumption in a certain amount, it has a disadvantage caused by using more clock cycles to fetch the desired data, compared to other conventional set-associative caches.
Another popular approach is using a way-prediction (WP) cache [10, 12] in which a prediction logic is added to a conventional set-associative cache structure, such as 2-way set, 4-way set, etc. The prediction logic predicts a way, which is the target bank to be accessed. On a prediction miss, all remaining banks in the cache must be accessed. It is known that the way-prediction cache reduces energy consumption more effectively than the phased cache . Although the prediction logic itself consumes some amount of extra energy, the cache system consumes less total energy than the conventional cache. One additional gain from using such prediction is reduced latency since the cache behaves like a direct-mapped cache when a prediction hits . An approach proposed in  shows better power reduction compared to a way-prediction cache, but it employs complicated cache structures, such as two cache structures (way-prediction and phased structures) and multicolumn-based way-prediction mechanism, by using 2-way branch prediction techniques.
3. Dual-access way-prediction cache
In this section, we describe our proposed low-power/latency cache named dual-access way-prediction or DAWP cache, which is an enhanced way-prediction cache and is suitable for building energy-efficient embedded systems for highly associative cache structures.
The proposed prediction scheme reflects the spatial locality of data. In other words, a sequence of related data (spatial locality) tends to be located and accessed in the same bank (way). As we described earlier, the bank information for the prediction is updated via memory address (index part) and global history. To reduce the power consumption in the DAWP cache, it is desired to reduce conflicts in the index table and the fully associative cache. In fact in our design, the highly biased accesses are filtered using the global history, and the hit time is reduced since the system accesses only one bank instead of accessing full n-way banks on prediction hits.
The dual-access prediction mechanism used in the DAWP cache yields high accuracy since the scaled-up index table and the fully associative cache are used. Unlike the WP cache, the performance gain of the DAWP cache (against the conventional cache) increases with increasing set associativity of the cache.
4. Simulation model and performance measurement metrics
In our practice, we built a simulator to measure the performance of the proposed DAWP cache. The simulation is done with 16, 32, and 64 KB cache sizes and six different cache structures, i.e., direct-mapped, 2-way, 4-way, 8-way, 16-way, and 32-way set-associative caches. It is assumed in the simulation that the cache uses WB (write back) and write-allocation mechanisms on write-hit and write-miss, respectively. Our simulation model is based on SimpleScalar  and CACTI , and WP and DAWP cache modules are implemented and ported into the simulator. For performance measurement, we use nine SPEC2000 programs, which are art, ammp, equake, mesa, mcf, vpr, vortex, gcc, and gzip.
In the simulator, CACTI  is used to calculate the memory access latency and power consumption amount, and we followed the estimation methodologies used in . Other data, which are prediction hit/miss rate and cache hit/miss rate, are obtained from the SimpleScalar part of the simulator. In order to calculate the latency and power consumption amount, cache accesses are categorized into three cases: (1) The first case is a correct way-prediction. In this case, the latency and power dissipation are based on the accesses to the index table and the predicted bank, i.e., analogous to the operation on the direct-mapped cache. (2) The second case is a wrong way-prediction but is a cache hit. In this case, the way-predictor fails to select the correct bank, but the referenced data is found in a different bank. This requires the index table to be updated with the correct bank information after the data access. The latency and power dissipation are based on an access to the index table, an access to one bank, accesses to all other banks in the cache, e.g., n-1 banks for n-way set, and an access to update the index table. (3) The last case is a cache miss, which is the worst case and consumes time and energy for all the accesses listed in the second case plus a costly access to the lower-level memory to fetch the referenced data.
5. Experimental results
From our experiments, we observed that the proposed DAWP cache yields higher way-prediction accuracy than the WP cache for highly associative caches and yields considerably higher power/latency efficiency than the conventional cache with the benchmark programs that we used. We also observed that the power and latency gains are more significant according to the increasing set associativity.
Dual-access way-prediction (DAWP) cache, which is an enhanced way-prediction cache, is proposed and the performance measurement is done based on the simulation model built from the SimpleScalar and CACTI. In our practice, nine SPEC2000 benchmark programs are used to measure the power and latency efficiencies of the DAWP cache.
From our experiments, we observed that the proposed DAWP cache yields higher way-prediction accuracy than the WP cache for highly associative caches and yields considerably higher power/latency efficiency than the conventional cache with the benchmark programs that we used. We also observed that the power and latency gains from using the DAWP cache are more significant according to the increasing set associativity; in fact, it is more significant for the power efficiency gain than the latency gain. This demonstrates that the proposed DAWP cache is relatively more efficient with highly associative caches. One additional observation from our experiment is that the random replacement policy yields better performance in both latency and power dissipation than the LRU replacement policy with the DAWP cache.
- Hennessy J, Patterson DA: Computer Architecture − a Quantitative Approach. Waltham: Morgan Kaufmann, Elsevier; 2012.MATHGoogle Scholar
- Montanaro J, Witek RT, Anne K, Black AJ, Cooper EM, Dobberpuhl DW, Donahue PM, Eno J, Hoeppner GW, Kruckemyer D, Lee TH, Lin PCM, Madden L, Murray D, Pearce MH, Santhanam S, Snyder KJ, Stehpany R, Thierauf SC: A 160-MHz, 32-b, 0.5-W CMOS RISC microprocessor. IEEE J. Solid State Circuits 1996, 31(11):1703-1714. 10.1109/JSSC.1996.542315View ArticleGoogle Scholar
- Flynn MJ, Hung P: Microprocessor design issues: thoughts on the road ahead. IEEE Micro 2005, 25(3):16-31. 10.1109/MM.2005.56View ArticleGoogle Scholar
- Zhang C: A low power highly associative cache for embedded systems. Proceedings of the IEEE International Conference on Computer Design (ICCD), San Jose, CA, 1–4 October 2007 31-36.Google Scholar
- Alipour M, Moshari K, Bagheri M: Performance per power optimum cache architecture for embedded applications, a design space exploration. Proceedings of the 2nd IEEE International Conference Networked Embedded Systems for Enterprise Applications (NESEA), Fremantle, WA, 8–9 December 2011 1-6.Google Scholar
- Furber SB, Thomas ARP, Oldham HE, Howaid DW: ARM3 - 32b RISC processor with 4kbyte on-chip cache. VLSI: Proceedings of the IFIP TC 10/WG 10.5 International Conference on Very Large Scale Integration, Munich, 16–18 August 1989 35-44.Google Scholar
- Santhanam S, Baum AJ, Bertucci D, Braganza M, Broch K, Broch T, Burnette J, Chang E, Chui KT, Dobberpuhl D, Donahue P, Grodstein J, Kim I, Murray D, Pearce M, Silveria A, Soudalay D, Spink A, Stepanian R, Varadharajan A, Wen R: A low-cost, 300-MHz, RISC CPU with attached media processor. IEEE J. Solid State Circuits 1998, 33(11):1829-1839. 10.1109/4.726584View ArticleGoogle Scholar
- Intel: 3rd Generation Intel Xscale® Microarchitecture–Developer’s Manual. 2007.Google Scholar
- Powell MD, Agarwal A, Vijaykumar TN, Falsafi B, Roy K: Reducing set-associative cache energy via way-prediction and selective direct-mapping. Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO), Austin, 1–5 December 2001 54-65.Google Scholar
- Inoue K, Ishihara T, Murakami K: Way-predicting set-associative cache for high performance and low energy consumption. Proceedings of the 1999 International Symposium on Low Power Electronics and Design, San Diego, 16–17 August 1999 273-275.View ArticleGoogle Scholar
- Megalingam RK, Deepu KB, Joseph IP, Vikram V: Phased set associative cache design for reduced power consumption. Proceedings of the 2nd IEEE International Conference on Computer Science and Information Technology (ICCSIT 2009), Beijing, 8–11 August 2009 551-556.View ArticleGoogle Scholar
- Batson B, Vijaykumar TN: Reactive-associative caches. Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques, Barcelona, 8–12 September 2001 49-60.View ArticleGoogle Scholar
- Zhu Z, Zhang X: Access-mode predictions for low-power cache design. IEEE Micro 2002, 22(2):58-71. 10.1109/MM.2002.997880View ArticleGoogle Scholar
- Chen H, Chiang J: Low-power way-predicting cache using valid-bit pre-decision for parallel architectures. Proceedings of the 19th IEEE International Conference on Advanced Information Networking and Applications (AINA 2005), Taipei, 28–30 March 2005 203-206.Google Scholar
- Burger DC, Austin TM: The SimpleScalar tool set, version 2.0. Comput. Arch. News 1997, 25(3):13-25. 10.1145/268806.268810View ArticleGoogle Scholar
- Shivakumar P, Jouppi N: CACTI 3.0: An integrated cache timing, power, and area model, WRL Research Report 2001/2. Palo Alto: Compaq; 2001.Google Scholar
- Contreras G, Martonosi M: Power prediction for Intel XScale® processors using performance monitoring unit events. Proceedings of the 2005 International Symposium on Low Power Electronics and Design, San Diego, 8–10 August 2005 221-226.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.