Use of compiler optimization of software bypassing as a method to improve energy efficiency of exposed data path architectures
© Guzma et al.; licensee Springer. 2013
Received: 10 September 2012
Accepted: 8 April 2013
Published: 10 May 2013
In the design of embedded systems, hardware and software need to be co-explored together to meet targets of performance and energy. With the use of application-specific instruction-set processors, as a stand-alone solution or as a part of a system on chip, the customization of processors for a particular application is a known method to reduce energy requirements and provide performance. In particular, processor designs with exposed data paths trade compile time complexity for simplified control hardware and lower running costs. An exposed data path also allows the removal of unused components of interconnection network, once the application is compiled.
In this paper, we propose the use of a compiler technique for processors with exposed data paths, called software bypassing. Software bypassing allows the compiler to schedule data transfers between execution units directly, bypassing the use of a general-purpose register file, increasing scheduling freedom, with reduced dependencies induced by the reuse of registers, decreasing the number of read and write accesses to register files, and allowing the use of register files with less read and write ports while maintaining or improving performance and maintaining reprogrammability. We compare our proposal against an architecture exploration technique, connectivity reduction, which finds in compiled application all interconnection network components that are used and removes those which are not, leading to an energy-efficient application-specific instruction-set processor.
We observe that the use of software bypassing leads to improvements in application speed, with architectures having the smallest number of register file ports consistently outperforming architectures with larger number of ports, and reduction in energy consumption. In contrast, connectivity reduction maintains the same application speed, reduces energy consumption, and allows for increase in processor frequency; however, with the clock frequency increased to match the performance of software bypassing, energy consumption grows. We also observe that in case reprogrammability is not an issue, the most energy-efficient solution is a combination of software bypassing and connectivity reduction.
In an embedded domain, unlike in more traditional high-performance computing, performance closely relates to energy. Efficient solutions utilize the knowledge of application or application domain to explore hardware and software techniques, eventually leading to application-specific instruction-set processors, and to provide enough performance for a particular task, or set of tasks, while minimizing energy requirements. The exploration of available instruction-level parallelism (ILP) and the use of instruction-set extensions are common ways to improve clock cycle performance, leading to lower operation frequencies and lower energy requirements.
When exploring ILP in computation, large number of general-purpose registers contributes to the increase in performance. Having program variables in independent registers allows data-independent computation paths to be scheduled in parallel, on different execution units. However, in terms of algorithm computation, the time and energy spent on transferring values between function units and register files are wasted, not contributing to actual computation directly, since only the energy spent while computing in function units is actually useful to compute results.
In addition, when parallel execution requires access to several general-purpose registers in the same cycle, register files need to provide enough read and write ports to allow such access, leading to increased complexity and higher energy requirements of register files.
Another method on how to increase the performance of embedded processors is the customization of instruction set [1, 2]. Complex computation patterns can be implemented as custom function units, providing better performance and freeing other processor resources for independent computation paths. This customization, however, often leads to implementation with higher number of input values and produces several results, further elevating the problem of the number of register file ports and necessary data transports between function units and register files.
Yet another method to reduce energy requirements of a custom processor is the optimization of the data path. To maintain the best programmability and allow maximum ILP exploitation, processor interconnection networks tend to be designed for the worst case scenario. However, once the application schedule is set, we can simply remove all the unused components of the interconnection network, eventually maintaining only the required connections (connectivity reduction). This has an effect on reduction in interconnection network complexity, instruction fetch and decode energy requirements, and can allow for increase in processor clock frequency. An unfortunate effect of connectivity reduction, however, is limitation to reprogrammability, or no reprogrammability at all, of such a processor. In case the application is modified and needs to be recompiled for the reduced architecture, the compiler may produce inefficient schedule working around missing connections or fail to schedule completely.
In this paper, we propose the use of a compiler optimization technique called software bypassing, suitable for architectures with exposed data paths, as a tool for improving energy efficiency. By allowing the compiler to schedule data transfers directly between function unit outputs and inputs, reading the value of previous computation from the register file becomes unnecessary. This helps reduce unnecessary energy costs of register files. In addition, in the case where all of the uses of produced values can be bypassed directly, the actual write of value to general-purpose register can be discarded during compilation (dead result move elimination), reducing the total number of data transfers on the interconnection network and further contributing to reduction in processor energy consumption.
Additional benefit of this optimization is the reduction of false data dependencies created by register allocation when several program variables reuse the same register to store the data, effectively allowing the scheduler more scheduling freedom and increases available ILP to explore during instruction scheduling.
We reason that the combined benefits of software bypassing (reduction in register file reads and writes, reduction in the number of data transfers on the interconnection network, and improved cycle count) match those of connectivity reduction when it comes to energy efficiency while maintaining full reprogrammability.
In situations where reprogrammability is not an issue, we propose the use of a combination of software bypassing and connectivity reduction. We reason that the benefits of these two complement each other, providing for large energy savings. In particular, in order to match the clock cycle improvements gained by the use of software bypassing, a processor with only reduced connectivity needs to run with higher clock frequency, eventually leading to an increase in energy requirements, possibly exceeding energy budget.
In our previous work , we considered a conservative approach to software bypassing and investigated its effect on clock cycle performance and reduction in register files reads and writes depending on the heuristic parameter guiding bypassing decisions. We also discussed the impact of heuristic when to bypass on register file and interconnection network energy consumption, however, using only a hardware estimator  and single architecture.
In this paper, we propose a novel bypassing algorithm and use an extensive set of register file architecture types to investigate the effect of bypassing and connectivity reduction on the energy of various processor core components directly influenced by one or both of the methods. In addition, we investigate the cost of matching the performance improvements brought by bypassing via synthesis for higher clock frequency when only connectivity reduction is available. We evaluate our approach using commercially available tools for processor synthesis, gate level simulation, and power analysis.
The rest of this paper is organized as follows: Section 2 discusses previous work. Section 3 gives a short introduction to architectures with exposed data paths and describes our choice, transport-triggered architectures. Section 4 gives an overview of our novel software bypassing algorithm with integrated dead result move elimination, as well as connectivity reduction. Section 5 outlines our experimental setup, and Section 6 provides a discussion of the results of our experiments. Finally, Section 7 concludes this paper.
2 Related work
Effects of bypassing register files are known and appreciated in processor design [5, 6], with reported register file power reduction of 12% on average for Intel XScale processor and performance loss of 2% in  and up to 80% register file energy reduction compared to Reduced Instruction Set Computer (RISC)/Very Long Instruction Word (VLIW) counterparts in . More and more effort is spent in focusing on computation and distancing from the temporary data storage.
The traditional use of register files for storing data becomes a problem with monolithic register files (RF) in VLIW processors with a large number of function units. The requirement of a large number of RF ports in such case makes the use of monolithic RF prohibitively expensive. Common solutions involve the clustering of RF into a number of smaller ones. Intercluster communication can then be implemented using RF to RF copying and/or read/write between dedicated function units and RF across clusters. However, as shown in [7, 8], using only register to register copies between clusters reduces achievable ILP when compared to monolithic RF. Results closer to a monolithic RF file can be achieved with the use of direct reads and writes from a dedicated function unit to RF in different clusters, suggesting that the RF to RF copies between clusters should be avoided when possible.
Another step towards better performance and more achievable ILP is bypassing data directly from function unit to function unit, avoiding the use of RF altogether. Such a solution improves performance and reduces the energy required by RF but can be also used to reduce the number of required RF ports while retaining performance.
The effective use of RF bypassing is dependent on the architecture’s division of work between the software and the hardware. In order to bypass the RF, the compiler or hardware logic must be able to determine what the consumers of the bypassed value are, effectively requiring data flow information, and how the direct operand transfer can be performed in the hardware.
While hardware implementations of RF bypassing may be transparent to a programmer, they also require additional logic and wiring in the processor and can only analyze a limited instruction window for the required data flow information. Hardware implementations of bypassing cannot get the benefit of reduced register pressure since the registers are already allocated to the variables when the program is executing. However, the benefits from the reduced number of RF accesses are achieved. Register renaming  also increases available ILP by the removal of false dependencies. Dynamic strands presented in  are an example of an alternative hardware implementation of RF bypassing. Strands are dynamically detected atomic units of execution where registers can be replaced by direct data transports between operations. In Explicit Data Graph Execution (EDGE) architectures , operations are statically assigned to execution units, but they are scheduled dynamically in a data-flow fashion. Instructions are organized in blocks, and each block specifies its register and memory inputs and outputs. Execution units are arranged in a matrix, and each unit in the matrix is assigned a sequence of operations from the block to be executed. Each operation is annotated with the address of the execution unit to which the result should be sent. Intermediate results are thus transported directly to their destinations.
Static strands in  follow an earlier work  to decrease hardware costs. Strands are found statically during compilation and annotated to pass the information to the hardware. As a result, the number of required registers is reduced already in the compile time. This method was, however, applied only to transient operands with a single definition and single use, effectively up to 72% of dynamic integer operands, bypassing about half of them . The authors reported 16% to 24% savings in issue energy, 17% to 20% savings in bypass energy, 13% to 14% savings in register file energy, and 15% improvement in instruction per cycle, using a cycle-accurate simulator for two hardware models: Renesas (formerly Hitachi) SuperH SH4a and IBM PowerPC 750FX embedded microprocessor.
Dataflow mini-graphs  are treated as atomic units by a processor. They have the interface of a single instruction, with intermediate variables alive only in the bypass network. In , architecturally visible ‘virtual registers’ are used to reduce register pressure through bypassing. In this method, a virtual register is only a tag marking data dependence between operations without having physical storage location in the RF.
Software implementations of bypassing analyze codes during compile time and pass to the processor the exact information about the sources and the destinations of bypassed data transports, thus avoiding any additional bypassing and analyzing logic in the hardware. This requires an architecture with an exposed data path that allows such direct programming, like the transport-triggered architectures (TTA) [15, 16], synchronous transfer architecture, FlexCore, no-instruction-set-computer, or static pipelining. A commercial application of the TTA paradigm is the Maxim MAXQ general-purpose microcontroller family .
The assignment of destination addresses in an EDGE architecture corresponds to software bypassing in a transport-triggered setting. Software-only bypassing was previously implemented for a TTA architecture using the experimental MOVE framework [22, 23] and MOVE-Pro . TTAs are a special type of VLIW architectures. They allow programs to define explicitly the operations executed in each function unit (FU) as well as to define how (with position in instruction defining bus) and when data are transferred (moved) to each particular port of each unit. With the option of having registers in input and output ports of FUs, TTA allows the scheduler to move operands to FUs in different cycles and read results several cycles after they are computed. Therefore, the limiting factor for bypassing is the availability of connections between the source FU and destination FUs. In our previous work , we introduced a simple, conservative software bypassing implementation. We, however, only focused on the improvements in cycle counts and register file read and write accesses when changing bypass aggressiveness heuristics. In , we discussed the mentioned simple bypassing algorithm in terms of energy only using a hardware cost estimation model  and single architecture.
3 Exposing data paths: transport triggering approach
TTA  is an exposed data path architecture which allows the number of architectural resources to be selected, e.g., selection of the number and size of register files, number of read and write ports for each register file individually. Similarly, the number of function units as well as the operation set of each function unit can be defined by an architecture designer. To connect them together, the interconnection network is designed, with choice of the number of buses and sockets to be used. Each socket provides connection between the function unit or register file port and one or more buses. This allows for fully connected architectures, with most compiler freedom to choose how to transport data between the source and destination, as well as heavily reduced connectivity, with buses connecting only a small number of components. It is necessary to point out that a complex fully connected interconnection network is expensive and limits the maximum clock frequency. Alternatively, an optimized connectivity could allow for higher frequency; however, it reduces scheduling freedom. With less alternatives for transport, potentially parallel data transports often need to be serialized.
Another interesting aspect of TTA comes from VLIW inheritance. An instruction defines what data transports are to be performed on each bus, which leads to wide instruction words. As a matter of fact, for each bus in the system, the instruction word encodes the source field of transport as well as destination field of transport. While increasing number of buses leads to more freedom for the compiler to schedule, it also increases the instruction width. Reducing the connectivity between sockets and buses typically leads to a lower number of bits required to encode data transport for individual bus and narrows the instruction width. However, in order to significantly reduce the negative impact of the instruction width, instruction compression can be applied. Using dictionary compression [25, 26], for example, the code density can be improved significantly with a decrease in spent energy as well.
4 Software bypassing and connectivity reduction algorithms
In this section, we first discuss our two implementations of software bypassing: First is a conservative approach, previously published in , while the second one is an opportunistic approach, which has not been published previously.
4.1 Software bypassing and dead result move elimination
We can see that individual data transport to operand (denoted with .o suffix) registers and trigger registers (.t) from the general-purpose registers (rX or RX) are explicitly defined. In the same manner, the data transport from the result register (.r) to general-purpose registers are explicit. It is worth noting that the timing of data transport is not fixed. For instance, register r4 is written to the operand register of add one cycle before register R3 is written to the trigger register of add, starting actual computation. Similarly, the operand write of mul is two cycles before the trigger write of the same operation starts execution.
We can observe direct data transports from the result to operand or trigger registers. Data transports from the result register to the general-purpose register are completely scraped for both additions (dead result move elimination), as the results are used just once in multiplication, where they are transported directly from the result register of the adder. Compared to the first RISC-like example, we can observe that all the registers denoted with capital RX have disappeared, while the amount of work function units performed remains the same.
As the number of registers available in the architecture is limited, the compiler reuses the same registers to store different variables through the execution of the program. This leads to false dependencies when instruction scheduling as reordering of the data transport can be limited by not real data dependence, such as producer consumer, but false dependence such as write-after-read or write-after-write. The removal of the uses of registers reduces this problem induced by register allocation.
Our first bypassing algorithm is based on a data dependence graph as a part of list scheduling, previously published in . In order to prevent possible deadlocks, the algorithm uses conservative implementation, which first schedules data transports of all operands before attempting to bypass operands directly from the result registers of other function units which produced required values. Redundant result writes to the register file can be removed only once all the uses of the value written to the register file get bypassed and the value is not used outside the current basic block. As a heuristic when not to bypass, simple distance in cycles between the original write to the register, where the producer produces a value, and the cycle where the value is scheduled to be read from the register to the operand of the consumer is used (lookBackDistance). It is notable that this implementation works with top-down scheduling algorithms .
Our second bypassing algorithm is also based on a data dependence graph. However, while in the first case we used a top-down scheduling algorithm, in this case, we reversed the direction and implemented bypassing during bottom-up scheduling. Due to the nature of bottom-up scheduling, we start the scheduling of operation by scheduling all result moves of operation. This has an advantage of immediate availability of information whether or not all of the uses of the result value become bypassed and an unnecessary write to the register file can be removed immediately.
In addition, our implementation starts with scheduling of result moves with an attempt to find all bypass destinations and create direct bypass moves - early bypassing. Only in case some of the destinations cannot be bypassed, or the result value needs to be used outside the scope of the current basic block, the result move to register in the register file is scheduled. Afterwards, the bypassing is attempted once again for the destinations that did not get bypassed during early bypassing. While this late bypassing does not contribute to improvement in cycle counts of the current operation anymore, it still removes unnecessary read from the register file and frees register file read port.
Only once all of the result moves of the operation are scheduled, with or without bypasses, the algorithm attempts to schedule input operand moves as well. In case the schedule of operand moves fails, the result moves are unscheduled and a reschedule is attempted, with only early bypassing enabled. If the scheduling of operands still fails, result moves are unscheduled again and a reschedule is attempted, with only late bypassing enabled. Once again, if the schedule of operands still fails, scheduled moves are unscheduled and a reschedule is attempted, without any bypassing enabled. Only if all previous attempts fail, the starting cycle of the scheduling is decreased and a reschedule is attempted.
The outline of our scheduling algorithm is presented in Algorithm 1, with inputs denoting the set of the input operands of the operation being scheduled and with outputs denoting the possibly empty set of the results the operation produces. The ScheduleOperandWrites method simply tries to schedule input operands of the operation as late as possible once the results of the operation are successfully scheduled, taking into account the latency and pipeline characteristics of the operation on the selected function unit. The method UnscheduleResultReads simply unschedules all the previously scheduled results of the operation and undo possible bypasses.
Algorithm 1 Scheduling moves of single operation in bottom-up fashion
Actual bypassing of result moves is presented in Algorithm 2, with cycle denoting the starting cycle from which the scheduling starts, outputs denoting the possibly empty set of results the operation being scheduled produces, and the two flags bypassEarly and bypassLate indicating if bypassing should be attempted early or late or both. The method ScheduleCandidateALAP tries to schedule the original result move to register as late as possible, starting from cycle, in case not all of the result reads were successfully bypassed or if the result is used in a different basic block.
Algorithm 2 Schedule and bypass result reads
Other bypassing strategies are possible, including pre-register allocation bypassing , recursively bypassing chain of operations on critical path, bypassing after the block is fully scheduled without changing schedule to reduce only register file accesses, etc.
4.2 Simple connectivity reduction
With the freedom of design choices offered by transport-triggered architectures, the process of manually optimizing the connectivity of actual processor can become rather difficult. We start by manually selecting a register file configuration, as will be described in Section 5, and fully connected interconnection network.
We used simple connectivity reduction. The idea behind the algorithm is to schedule an application for fully connected TTA and then simply remove the connections of function units and register files to the buses that were never used in the existing schedule. In addition, whole function units and their respective sockets could be removed, if unused by the application. The reduction in the number of socket-to-bus connection should lead to less bits required to encode source and destination fields of data transports for buses and possibly allow for higher clock frequency to be achieved.
5 Experimental setup
Integer subset of CHStone benchmark used in our experiments
CHStone/The Portable Video Research Group
Architectures with a single multi-ported (SM) register file
SM 1×4×4 - 1 RF - 4 read 4 write ports (42 registers; Figure 2a)
Architectures with a single register file with a single read and single write port (SS) or multiple register files with a single read and single write port (MS)
SS 1×1×1 - 1 RF - 1 read 1 write port (42 registers; Figure 3a)
Architectures with multiple register files with multiple read and write ports (MM)
MM 2×2×1 - 2 RFs - 2 read 1 write port in each (2×21 registers; Figure 4a)
MM 2×2×2 - 2 RFs - 2 read 2 write port in each (2×21 registers; Figure 4b)
For each data input port, the register file contains an input opcode, which specifies the register ought to be written in the register bank, and an input trigger, which describes when the data are ought to be written to the register described in the corresponding opcode.
For each data output port, the register file contains an output opcode, which describes which register is fed to the output port.
When using the register file with single input and output ports, the complexity of the write and read control is transferred to the interconnection network and, while being visible in the data path, can be optimized more effectively by compiler techniques and in the design space exploration. Usually, the register file tends to be the end point of the critical path of the processors. Increasing the input control capacitance and delay by adding a write port has an effect not only on the input control logic of the register file, but also in the interconnection network. The addition of read port to the register file has minor effects to the capacitance and the delay of the control logic.
Remaining parts of our processor stay the same, with three-integer ALUs, one multiplier, a load store unit, and eighteen buses to accommodate for ILP available across our set of benchmarks.
We schedule our selected benchmarks for the set of preselected TTA designs four times. First, we use our previously published top-down scheduling algorithm , with actual software bypassing disabled, and collect resulting data, such as the number of clock cycles the benchmarked application needs to end and the number of reads and writes of the registers in all available register files. We also collect information about instruction width and the number of socket-to-bus connections.
Afterwards, we enable software bypassing with a top-down scheduler and recompile our set of benchmark applications for all the selected architectures again, collecting the same data as above.
Collecting information from more conservative software bypassing implementation, we repeat the steps above using our new, early software bypassing during the bottom-up scheduling algorithm, presented in Section 4.1. We collect data without software bypassing enabled and again with software bypassing enabled. This will allow us to consider differences that scheduling and bypassing strategies have.
For second test, for each combination of benchmark and architecture, we remove unused connections. This has no effect on actual cycle counts, or the number of register file reads and writes, but reduces the number of socket-to-bus connections and in some cases narrows the instruction word width as the number of bits required to address all sockets connected to any bus can drop. The application of bypassing of course changes the schedule, so in some cases, the number of removed connections can be higher for the case without bypassing and vice versa.
Taking those eight sets of data, we synthesize each architecture to 130-nm CMOS standard cell ASIC technology with Synopsys Design Compiler and a run gate-level simulation with Mentor ModelSim. From the results of the gate-level simulation, we acquire gate activity for the Synopsys Power compiler. From the Synopsys Power compiler, we acquire power used by individual architectural components of the processor core, such as interconnection network, individual register files, function units, instruction fetch, and instruction decode.
The processors were synthesized to a 250-MHz clock frequency (4-ns clock period) since for this value, architectures with a larger number of read and write ports can still be synthesized and meet timing constraints.
In addition, after collecting the results from the experiments as described above, we select the architecture configuration that showed the best energy efficiency with software bypassing across our set of benchmarks and take the benchmark’s cycle counts as a measure of real-time performance at a 250-MHz frequency. Connectivity reduction does not improve clock cycle performance; therefore, to achieve the same real-time performance, we compute the required frequency as follows: Required frequency=250 MHz×(Reduced connectivity cycles/Bypassed cycles). We then attempt to synthesize each of the benchmarks for it’s required frequency, run gate-level simulation, and collect power data, as described above.
Results of our experiments are discussed in detail in Section 6.
In the following, we first present results collected during setting up our experiment in Subsection 6.1. Afterwards, we discuss energy results obtained by synthesis and simulation in Subsection 6.2. In addition, we discuss results collected when trying to match real-time performance obtained by the use of software bypassing via synthesizing for higher frequency in Subsection 6.3.
6.1 Data collected before synthesis and simulation
In Figure 6c, we can see reduction in instruction width when connectivity reduction is applied, and Figure 6d shows the number of socket-to-bus connections left after connectivity reduction. It can be seen from Figure 6c,d that there is variation when applying connectivity reduction for cases without bypassing and with bypassing. Since bypassing causes changes to the schedule, there are added direct data transports between function units and the schedule is more compact, leading to more activity per cycle. On the other hand, the number of data transports between function units and register file decreases with software bypassing; therefore, connectivity to the register file can be reduced. It can be seen from those two figures that once again bottom-up scheduling and early bypassing leads to more reduction of instruction width and a lower number of connections left; variation is however only about 5%.
Overall, Figure 6 shows that the connectivity reduction indeed leads to reduction in instruction width and successfully removes a large number of socket-to-bus connections and that the software bypassing produces large reduction in dynamic register reads and writes as well as large drop in cycle counts, with eventually simpler architectures with a single read and single write port in the register file outperforming much larger architectures with multi-ported register files without bypassing. The combination of these reductions has an effect on the power of individual processor components and results in energy reduction.
However, we can also observe that the effect of software bypassing on the successful removal of connections is clearly limited. There is typically at most 5% variation in the number of connections removed between using software bypassing or not and similarly a small variation in instruction width reduction. A larger variation is visible between top-down scheduling with conservative bypassing and bottom-up scheduling with early bypassing. We observe that while the older top-down scheduler provides a better starting point in terms of clock cycles than new bottom-up scheduler implementation, the difference narrows when bypassing is enabled, and overall, in terms of register file accesses, connectivity removal, and instruction width, the novel algorithm performs better.
6.2 Results of synthesis and simulation
After performing gate-level simulation on all our benchmarks, architectures, and optimization combinations as described in Section 5, we collected power data for individual processor components. We computed the energy of individual processor components using the common formula Energy=Power×Cycles/Frequency. We computed the averages for all benchmarked applications to focus on overall trends, in addition to individual benchmarks.
Overall, results in Figure 11 show that the combination of software bypassing and connectivity reduction leads to energy savings up to 50% using single-ported register files compared to the energy required by the largest architecture with four read and four write ports while maintaining or improving cycle counts.
While in Figure 6a,b we observed a fairly consistent clock cycle performance across the range of architectures after the application of software bypassing as well as reduction in register file reads and write, Figure 13a clearly shows how expensive more complex register file configurations are, even with software bypassing.
Figure 13b shows the effect of bypassing and connectivity reduction on the interconnection network. Both reduced the connectivity and bypassing results in the drop of energy, with the combination of both providing the best results. However, while connectivity reduction causes a drop of energy by actually removing components that consume energy, the benefit from software bypassing is largely due to a decrease in cycle counts and interconnection network traffic. The use of new bottom-up scheduling and bypassing algorithm results in better interconnection energy results for all of the architectures.
6.3 Matching real-time performance via synthesizing for higher clock frequency
We observed that several architectures resulted in similar low-energy requirements for cases with bypassing. We selected an architecture with two register files, each with a single read and single write port. We attempted to achieve the same real-time performance, without bypassing, while applying only connectivity reduction and synthesizing for higher clock frequency.
Clock cycles and timing constraints to match real-time deadline for MS 2 × 1 × 1 architecture with bottom-up scheduling
Results in Figure 21 indicate that the increase in clock frequency required to achieve a shorter execution time to match performance with software bypassing leads to an increase in energy consumption for four of the eight benchmarks, when compared to the fully connected architecture with bypassing synthesized for 250 MHz.
Exceptions from this trend are aes, jpeg, mips, and motion benchmarks. As can be seen from Table 2, the impact of software bypassing for those two benchmarks was relatively limited, and therefore, only a relatively small increase in clock frequency was required.
Compared to the combination of software bypassing and reduced connectivity at 250 MHz, however, synthesizing for higher frequency leads to an increase in energy requirements for all the cases.
In this paper, we evaluated our proposed method on how to improve energy efficiency of processor cores for exposed data path architectures - software bypassing, against a design space exploration technique - connectivity reduction. Our observation shows that both, compiler optimization of software bypassing and architecture optimization of connectivity reduction, lead to a decrease in energy requirements of the processor.
In particular, we observed that for the architecture with several register files, connectivity reduction brings benefits equaling that of software bypassing overall, mainly due to the large number of removed connections and consequent decrease in instruction width. It has, however, no significant effect on the register file energy and does not contribute to performance increase as such. In the case of a single register file, the overall effect of connectivity reduction on energy savings is much smaller than that of software bypassing. It is notable, however, that connectivity reduction allowed for higher clock frequency with architectures with a small number of register file write ports.
Software bypassing, on the other hand, showed a fairly consistent improvements to cycle counts, across all tested architectures. Eventually, even the most limited register file configuration achieved better clock cycle performance than the architectures with a large number of read and write ports, without bypassing.
While software bypassing does not contribute to reduction in instruction width, or reduction in the complexity of the interconnection network, the main benefits of software bypassing come from cycle count improvements and associated energy savings across all components and from register file savings.
We observed that software bypassing provides similar or better energy efficiency to processor customization by reducing connectivity while maintaining the full programmability of the processor.
We also showed that in order to match the performance achievable with software bypassing, architectures with reduced connectivity can be synthesized with higher frequency. However, this results in four of eight benchmarks increasing their energy requirements compared to software bypassing with a fully connected network as well as the loss of the reprogrammability.
In addition, the combination of software bypassing and reduced connectivity is more energy efficient than synthesizing for higher frequency for all the benchmarks. Therefore, if the reprogrammability is not an issue, it is still more effective to combine software bypassing with reduced connectivity, combining their respective benefits, than to use only reduced connectivity and synthesize for higher frequency.
Part of the work presented in this paper has been financially supported by the Academy of Finland (funding decision 253087) and Radio Laboratory of Nokia Research Center.
- Saghir MAR, El-Majzoub M, Akl P: Datapath and ISA customization for soft VLIW processors. In Proceedings of the IEEE International Conference on Reconfigurable Computing and FPGA’s. San Luis Potosi: Springer-Verlag Berlin, Heidelberg ľ 2007; 20–22 Sept 2006:1-10.Google Scholar
- Clark N, Zhong H, Mahlke S: Processor acceleration through automated instruction set customization. In MICRO-36. San Diego; 3–5 Dec 2003:129-140.Google Scholar
- Guzma V, Jääskeläinen P, Kellomäki P, Takala J: Impact of software bypassing on instruction level parallelism and register file traffic. Lecture Notes in Computer Science, vol. 5114. In Embedded Computer Systems: Architectures, Modeling, and Simulation. Edited by: Wong S, Bereković M, Dimopoulos N. Heidelberg: Springer-Verlag Berlin, Heidelberg ľ2008; 2008:23-32.View ArticleGoogle Scholar
- Pitkänen T, Rantanen T, Cilio AGM, Takala J: Hardware cost estimation for application-specific processor design. Lecture Notes in Computer Science, vol. 3553. In SAMOS. Edited by: Hämäläinen TD, Pimentel AD, Takala J, Vassiliadis S. Berlin Heidelberg: Springer; 2005:212-221.Google Scholar
- Park S, Shrivastava A, Dutt N, Nicolau A, Paek Y, Earlie E: Bypass aware instruction scheduling for register file power reduction. In Proceedings of the 2006 ACM SIGPLAN/SIGBED Conference on Language, Compilers, and Tool Support for Embedded Systems, LCTES ’06. New York: ACM; 2006:173-181.View ArticleGoogle Scholar
- She D, He Y, Mesman B, Corporaal H: Scheduling for register file energy minimization in explicit datapath architectures. In Design, Automation Test in Europe Conference Exhibition (DATE), 2012. Dresden: IEEE Computer Society Washington, DC, USA ľ2012, isbn 978-1-4577-2145-8; 12–16 Mar 2012:388-393.Google Scholar
- Gangwar A, Balakrishnan M, Kumar A: Impact of intercluster communication mechanisms on ILP in clustered VLIW architectures. ACM Trans. Des. Autom. Electron. Syst 2007, 12: 1.View ArticleGoogle Scholar
- Terechko A, Le Thenaff E, Garg M, van Eijndhoven J, Corporaal H: Inter-cluster communication models for clustered VLIW processors. In HPCA ’03: Proceedings of the 9th International Symposium on High-Performance Computer Architecture. Washington: IEEE Computer Society; 2003:354-354.Google Scholar
- Patterson DA, Hennessy JL: Computer Organization and Design: The Hardware/Software Interface. San Francisco: Morgan Kaufmann; 1998.Google Scholar
- Sassone PG, Wills DS: Dynamic strands: collapsing speculative dependence chains for reducing pipeline communication. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture. Washington: IEEE Computer Society; 2004:7-17.Google Scholar
- Burger D, Keckler SW, McKinley KS, Dahlin M, John LK, Lin C, Moore CR, Burrill J, McDonald RG, Yoder W: the TRIPS Team, Scaling to the end of silicon with EDGE architectures. Computer 2004,37(7):44-55.View ArticleGoogle Scholar
- Sassone PG, Wills DS, Loh GH: Static strands: safely exposing dependence chains for increasing embedded power efficiency. Trans. on Embedded Computing Sys 2007,6(4):24. 10.1145/1274858.1274862View ArticleGoogle Scholar
- Bracy A, Prahlad P, Roth A: Dataflow mini-graphs: amplifying superscalar capacity and bandwidth. In MICRO 37: Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture. Washington: IEEE Computer Society; 2004:18-29.Google Scholar
- Yan J, Zhang W, Bosschere KD, Kaeli DR, Stenström P, Whalley DB, Ungerer T: Virtual registers: reducing register pressure without enlarging the register file. Lecture Notes in Computer Science, vol 4367. In HiPEAC. Berlin: Springer; 2007:57-70.Google Scholar
- Corporaal H: Microprocessor Architectures: From VLIW to TTA. Chichester: Wiley; 1997.Google Scholar
- He Y, She D, Mesman B, Corporaal H: MOVE-Pro: a low power and high code density TTA architecture. In Proceedings of the 11th International Conference on Embedded Computer Systems (SAMOS-XI). Samos: Springer-Verlag Berlin, Heidelberg ľ2012; 18–21 July 2011.Google Scholar
- Cichon G, Robelly P, Seidel H, Bronzel M, Fettweis G: Compiler scheduling for STA-processors. In PARELEC ’04: Proceedings of the International Conference on Parallel Computing in Electrical Engineering. Washington: IEEE Computer Society; 2004:45-60.View ArticleGoogle Scholar
- Thuresson M, Själander M, Björk M, Svensson L, Larsson-Edefors P, Stenström P: FlexCore: utilizing exposed datapath control for efficient computing. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation. Samos: Springer-Verlag Berlin, Heidelberg ľ2007; 16–19 July 2007:18-25.Google Scholar
- Reshadi M, Gorjiara B, Gajski D: Utilizing horizontal and vertical parallelism with no-instruction-set compiler for custom datapaths. In Proceedings of the 23rd International Conference on Computer Design. San Jose: IEEE Computer Society Washington, DC, USA ľ2005; 2–5 Oct 2005:69-74.Google Scholar
- Finlayson I, Uh GR, Whalley D, Tyson G: An overview of static pipelining. Comput. Architecture Lett 2012, 11: 17-20.View ArticleGoogle Scholar
- Maxim Corporation: MAXQ Microcontroller home page 2007.http://www.maxim-ic.com/products/microcontrollers/maxq.cfm . Accessed 7 May 2013
- Corporaal H, Mulder HJ: MOVE: a framework for high-performance processor design. In Proceedings of the ACM/IEEE Conference on Supercomputing. ACM New York, NY, USA ľ1991; 18–22 Nov 1991:692-701.Google Scholar
- Janssen J, Corporaal H: Partitioned register file for TTAs. In Proceedings of the 28th Annual Symposium on Microarchitecture (MICRO-28). Ann Arbor: IEEE Computer Society Press Los Alamitos, CA, USA ľ1995; 29 Nov–1 Dec 1995:303-312.View ArticleGoogle Scholar
- Guzma V, Kellomäki T, Takala J, Pitkänen P: Reducing processor energy consumption by compiler optimization. In Proceedings of the IEEE Workshop Signal Processing Systems. Tampere: IEEE Computer Society Washington, DC, USA, 2009; 7–9 Oct 2009:63-68.Google Scholar
- Heikkinen J, Cilio A, Takala J, Corporaal H: Dictionary-based program compression on transport triggered architectures, vol. 2. In IEEE International Symposium on Circuits and Systems, 2005. ISCAS 2005. Kobe: IEEE Computer Society Washington, DC, USA, 2005; 23–26 May 2005:1122-1125.Google Scholar
- Heikkinen J, Takala J, Corporaal H: Dictionary-based program compression on TTAs: effects on area and power consumption. In IEEE Workshop on Signal Processing Systems Design and Implementation, 2005. Athens: IEEE Computer Society Washington, DC, USA, 2005; 2–4 Nov 2005:479-484.View ArticleGoogle Scholar
- Aho AV, Lam MS, Sethi R, Ullman JD: Compilers: Principles, Techniques, and Tools. Boston: Addison Wesley; 2006.Google Scholar
- Kellomäki P, Guzma V, Takala J: Safe pre-pass software bypassing for transport triggered processors. Acta Technica Napocensis 2008,49(3):5-10.Google Scholar
- Jääskeläinen P, Guzma V, Cilio A, Takala J: Codesign toolset for application-specific instruction-set processors. In Proceedings of the SPIE Multimedia on Mobile Devices. San Jose: ’SPIE’ society; 28 Jan 2007:65070X-1–65070X–11.Google Scholar
- TCE: TTA-based co-design environment 2002.http://tce.cs.tut.fi . Accessed 7 May 2013Google Scholar
- Hara Y, Tomiyama H, Honda S, Takada H: Proposal and quantitative analysis of the CHStone Benchmark program suite for practical C-based high-level synthesis. J. Inf. Process 2009, 17: 242-254.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.