SmartCell: An Energy Efficient Coarse-Grained Reconfigurable Architecture for Stream-Based Applications
© C. Liang and X. Huang. 2009
Received: 2 February 2009
Accepted: 15 April 2009
Published: 30 June 2009
This paper presents SmartCell, a novel coarse-grained reconfigurable architecture, which tiles a large number of processor elements with reconfigurable interconnection fabrics on a single chip. SmartCell is able to provide high performance and energy efficient processing for stream-based applications. It can be configured to operate in various modes, such as SIMD, MIMD, and systolic array. This paper describes the SmartCell architecture design, including processing element, reconfigurable interconnection fabrics, instruction and control process, and configuration scheme. The SmartCell prototype with 64 PEs is implemented using 0.13 m CMOS standard cell technology. The core area is about 8.5 , and the power consumption is about 1.6 mW/MHz. The performance is evaluated through a set of benchmark applications, and then compared with FPGA, ASIC, and two well-known reconfigurable architectures including RaPiD and Montium. The results show that the SmartCell can bridge the performance and flexibility gap between ASIC and FPGA. It is also about 8% and 69% more energy efficient than Montium and RaPiD systems for evaluated benchmarks. Meanwhile, SmartCell can achieve 4 and 2 times more throughput gains when comparing with Montium and RaPiD, respectively. It is concluded that SmartCell system is a promising reconfigurable and energy efficient architecture for stream processing.
Nowadays, stream-based applications, such as multimedia, telecommunications, signal processing, and data encryptions, are the dominant workloads in many electronic systems. The real-time constraints of these applications, especially over portable devices, often have stringent energy and performance requirements. Many other military applications, including real-time synthetic aperture radar imaging, automatic target recognition, surveillance video processing, optical inspection, and cognitive radio systems, have similar needs. The general purpose solutions, such as general purpose processors (GPPs), are widely used in conventional data-path oriented applications due to their flexibility and ease of use. However, they cannot meet the increasing requirements on performance, cost, and energy in the data streaming application domain due to their sequential software executions. The application-specific integrated circuits (ASICs) become inevitably a customized solution to meet these ever-increasing demands for highly repetitive parallel computations. It is reported that they are potentially two to three orders of magnitude more efficient than the processors in terms of combined performances of computational power, energy consumption, and cost . Although ASICs can provide best performance for specific applications, it is not desirable for all circuitry designs. ASICs generally have fixed data flow with predefined functionalities that makes them infeasible to accommodate to new system requirements or changes in standards. The long design cycle and high nonrecursive engineering (NRE) cost also become an obstacle to meet the stringent cost and time-to-market requirements.
Reconfigurable architectures (RAs) have long been proposed as a way to achieve a balance between flexibility as of GPP and performance as of ASICs. The hardware-based RA implementation is able to explore the spatial parallelism of the computing tasks in targeted applications, meanwhile avoiding the instruction fetching, decoding, and executing overhead of the software implementations, which results in an energy and performance gain over general purpose processors. On the other hand, RAs maintain the postfabric flexibility to be configured, either offline or on the fly, to accommodate to new system requirements or protocol updates that is not feasible in ASIC implementations. Also, the flexibility provided by RAs can improve fault tolerance and system reliability. Design bugs can be easily fixed by loading new configurations, and malfunctioned circuitry can be excluded from other parts to achieve system recovery and prolong the product's lifetime.
Field-programmable gate arrays (FPGAs) are still the dominating semiconductor technology in the reconfigurable computing area. The most common SRAM-based FPGAs decompose complex logic functions into smaller ones and map them onto the Lookup Tables (LUTs) or other on-chip embedded resources. The island-style routing fabrics can be configured to form desired application datapath. The bit-level fine-grained granularity is suitable to implement a large variety of functions directly onto its rich hardware resources. However, this flexibility comes at a significant cost in terms of area, power consumption, and speed, due to its huge routing area overhead and timing penalty. Furthermore, due to the fine-grained nature, the compilation and configuration of FPGAs take much longer than those in general purpose processors.
In recognizing these issues, several research projects have been developed toward coarse-grained reconfigurable architectures, that include [2–7]. Benefiting from much less routing overhead, these coarse-grained reconfigurable architectures (CGRAs) have potential advantages to improve the power efficiency of the fine-grained FPGAs.
This paper presents SmartCell as a novel CGRA system, targeted for high-performance low-energy reconfigurable systems. SmartCell integrates a large number of tiny processor cores (cells) onto a single chip. The cells are interconnected with three levels of programmable switching fabrics, including intracell connection, nearest neighbor connection and a modified concentrated mesh (CMesh) on chip network. The feature of dynamic reconfigurability is achieved in two modes: coarse-grain cell broadcasting and fine-grain ID-based configurations. The number of processor elements involved in the computing tasks can also be dynamically changed to meet the application requirements. For example, more cells can be involved to achieve high computational performance, while fewer cells are put into active mode when power consumption is more stringent. It can be configured to operate in various computing styles such as SIMD, MIMD, and systolic array, targeted at applications with inherent data parallelism, high computing, and communication regularities.
The rest of this paper is organized as follows: after describing the background and examining some of the existing reconfigurable architectures in Section 2, the SmartCell architecture is detailed in Section 3, including the designs of computational units, layered interconnection fabrics, and control/configuration schemes. Section 4 presents a case study of mapping matrix multiplication onto the SmartCell and analyzing its performance. Section 5 presents the implementation of a seedling SmartCell system and its performance comparisons with other computing systems, followed by the conclusions in Section 6.
2. Background and Related Work
In recent years, the state of computing has evolved from a single CPU into a network of multiprocessors on chip running in parallel at relatively low frequencies. The major driving force behind is the need to achieve high computational capacity meanwhile maintaining high-energy efficiency. This feature has become critically important for many DSP applications, especially in portable devices. In this section, we briefly summarize several computing architectures targeted for data streaming applications.
Traditional FPGAs use the static configuration streams to control the functional and routing resources for user specifications. The data parallelism and flexible on-chip communications are essential to meet the high-performance requirements of the computations being performed. Due to the direct mapping of application tasks onto hardware resources, FPGA is able to complete one operation in a single clock cycle, which avoids the instruction fetching, decoding, and executing overheads as in the software processors.
However, as we mentioned previously, the fine-grained granularity and the general purpose LUTs in the FPGAs involve high routing overheads and intensive controlling requirements, which in turn degrades the performance and makes the compilation and configuration very slow. In realizing these problems, the commercial FPGA vendors have introduced more coarse-grained components as the computing basics in their newly developed FPGAs. The Virtex-4  and Virtex-5  series are among the latest Xilinx FPGAs, which have a mixed granularity of basic logic cells with coarse-grained DSP slices to enhance the signal processing capacity and power consumption performance. Similar idea can be found in Altera's Stratix II FPGAs . In addition, 6-input and 8-input LUTs are introduced to the Virtex 5 and Stratix II FPGAs, respectively, as the substitute of the traditional 4-input LUT. This is helpful to reduce the routing overhead and to ease the configuration process. But the SRAM-based FPGAs have some fundamental limits that hamper them becoming the mainstream computing media for the data streaming applications. The system configuration SRAM cells are very greedy on power and area and are needed to be held during the entire operation process. It is studied in  that the FPGA consumes about 14 times more dynamic power and is about 35 times larger than equivalent ASICs on average when only logic elements are used. Furthermore, the FPGAs do not support instruction sequencing and are thus infeasible or very costly to make any changes on the fly. Unlike FPGAs, the SmartCell system can achieve the online flexibility by simply pointing to a new instruction code. The coarse-grained nature of the SmartCell avoids low level compilation and high routing overhead involved in the FPGA designs. The hardware circuitry for the domain specific operations also results in smaller area and better energy efficiency than FPGAs.
A number of researches have been carried out to explore efficient CGRA designs as summarized in . The RAW  system incorporates 16 simplified 32-bit MIPS processors in a 2D mesh structure to provide high parallel computing capacities. The RaPiD  architecture links a large number of heterogeneous reconfigurable components in a 1D array structure, including ALUs, multipliers, and RAMs. The potential applications for RaPiD are those of a linear systolic nature or applications that can be easily pipelined among the computational units. In the PipeRench  system, several reconfigurable pipeline stripes are offered as an accelerator for data streaming applications. Limited configurable interconnection fabrics are developed, including a local network inside a stripe, unidirectional nearest neighbor connection between stripes, and some global buses. The Matrix  approach incorporates a large number of 8-bit Basic Functional Units (BFUs) in a 2D mesh structure. Its routing fabrics provide 3 levels of 8-bit bus connections, which can be configured into SIMD, MIMD, or VLIW styles. In the Chess  system, 4-bit ALUs are tiled in a hexagonal array with adequate reconfigurable interconnect fabrics to build a chessboard-like floorplan. The configuration contexts of a complex function can be cast and forwarded among active processors. Thus the functionality of the ALUs can be changed on a cycle-by-cycle basis. The MorphoSys  is an integrated and configurable system on chip, targeted for high throughput and data parallel applications. A modified RISC processor is embedded to control the reconfigurable accelerating arrays with efficient memory interface between them. The MorphoSys system is potentially to be operated in the SIMD style due to the column wise or row wise configuration broadcasts.
Many other reconfigurable architectures have been implemented with various technologies [13–15]. Most of them have been focused on the exploring of computational models or efficient design with respect to area and performance. The Processor-In-Memory- (PIM-) based systems [16, 17] integrate the processing logic and memories onto the same chip and try to perform the computations directly in memories, which greatly reduces data transfer overhead between CPU and main memory. The power consumption is another important aspect of reconfigurable architecture designs. In [18, 19], power efficient architectures are developed for specific applications and are compared with fine-grained implementations. However, they are not generic coarse-grained architectures but some specific models for data streaming applications.
Most recently, some other CGRA systems have also been developed to provide ultralow power consumption, such as ADRES  and Montium  with limited computing resources. The SmartCell system integrates some of the prominent features in the previous systems. The 16-bit granularity of the basic operations is efficient for the data parallelism exploration, while keeping a low cost of computing and communications. The SmartCell can also be configured to operate in SIMD, MIMD, and systolic array styles due to the distributed configuring contexts and rich on-chip connections. Dynamic configuration can be performed in both fine and coarse-grain modes. The uniform delay of the hierarchical interconnections also eases the scheduling of the stream processing among multiple cell units. In combination of these features, we say that the SmartCell system is a unique approach in the CGRA family. In Section 5, a direct comparison of the SmartCell performance in terms of energy efficiency and system throughput against RaPiD and Montium will be presented.
3. SmartCell Architecture
3.1. Key Features
In a typical SmartCell architecture, a set of cell units is organized in a tiled structure. Each cell block consists of four processing elements (PEs) along with the control and data switching fabrics. A three-level layered interconnection network is designed for the intra- and intercell communications. A serial peripheral interface (SPI) is designed as an efficient way to load/reconfigure instruction codes into active cell units. By reconfiguring the instruction memories, the data flow can be dynamically changed to accommodate to different application demands. Some important features of the SmartCell architecture are summarized as follows.
(i)Coarse-grained granularity: SmartCell is designed to generate coarse-grained configurable system targeted for computation intensive applications. The processing elements operate on 16-bit input signals and generate a 36-bit output signal, which avoids high overhead and ensures better performance compared with fine-grained architectures.
(ii)Flexibility: due to the rich computing and communication resources, versatile computing styles are feasible to be mapped onto the SmartCell architecture, including SIMD, MIMD, and 1D or 2D systolic array structures. This also expands the range of applications to be implemented.
(iii)Dynamic reconfiguration: by loading new instruction codes into the configuration memory through the SPI structure, new operations can be executed on the desired PEs without any interruption with others. The number of PEs involved in the application is also adjustable for different system requirements.
(iv)Fault tolerance: fault tolerance is an important feature to improve the production yields and to extend the device's lifetime. In the SmartCell system, defective cells, caused by manufacturing fault or malfunctioned circuits, can be easily turned off and isolated from the functional ones.
(v)Deep pipeline and parallelism: two levels of pipeline are achieved—the instruction level pipeline (ILP) in a single processor element and the task level pipeline (TLP) among multiple cells. The data parallelism can also be explored to concurrently execute multiple data streams, which in combine ensures a high computing capacity.
(vi)Hardware virtualization: in our design, distributed context memories are used to store the configuration signals for each PE. The cycle-by-cycle instruction execution supports hardware virtualization that is able to map large applications onto limited computing resources.
(vii)Explicit synchronization: a program counter (PC) is designed to schedule instruction execution time for each PE on the fly. Variant delays are also available for input/output signals inside each PE. Therefore, the SmartCell can provide explicit synchronization that eases the exploration of computing parallelisms.
(viii)Unique system topology: the cell units are tiled in a 2D mesh structure with four PEs inside each cell. This topology provides variant computing densities to meet different computational requirements. With the help of the hierarchical on-chip connections, the SmartCell architecture can be dynamically reconfigured to perform in variant operational styles.
3.2. Cell Unit and Processing Element
List of basic operations supported by SmartCell processors.
add, sub, mult, MAC, abs sum
and, or, not, xor, nand, compare, etc.
shift right, shift left, circular shift
An up to 4-stage pipeline structure is developed in each processor, as denoted in different colors in Figure 2. The Src select stage inputs data from the on-chip connection calculated by other PEs or itself and stores the data into its local register banks. The execution stages (Exe1 and Exe2) occupy two clock cycles for basic multiply-add and other logic operations. The Des select stage selects the output result and sends it back to the on-chip interconnections. Unlike traditional pipelined processor design, the pipeline stages are not fixed in SmartCell. Bypass path can be selected in every stage except for Src select to allow fast passing through input data or intermediate results to the next operating unit to reduce the processing delay if required. The traditional decoding stage is replaced by an instruction controller, which generates all control and scheduling signals in parallel with the 4 pipeline stages. An instruction code, pre-stored into the instruction memory, is loaded into the instruction controller on a cycle-by-cycle basis to provide both functionality and datapath control for a specific algorithm. In summary, the flexible 4-stage pipeline structure avoids the deep instruction pipelines of fetching, decoding, and registering read/write and ALU operations in conventional general purpose processors. Additionally, the instruction code can be dynamically reconfigured in various modes to adapt to different application requirements. Therefore, SmartCell is able to provide comparable energy efficiency as an ASIC while maintains dynamic programmability as a DSP.
Frame format of the instruction code.
64 bits/instruction code
No. of bits
3.3. Three-Level Layered On-Chip Interconnection
As the CMOS technology scaling down, interconnect has become an increasingly important issue for integrated circuit design. In many signal processing applications, the system throughput is significantly affected by communication costs. The design of efficient data exchange scheme has become a key feature for high-performance systems. Shared bus connection with high bandwidth is usually adopted in modern multicore CPU designs. But the lack of scalability and high power consumption penalty make it not favorable for data streaming applications. Other solutions are available for on-chip switch topology, such as fully connect crossbar and island-style mesh networks. The crossbar network provides the flexibility to connect any components in the network with limited transfer delays. Despite of these advantages, crossbar suffers from high silicon area costs, high power consumption, and low scalability. On the other hand, island-style mesh network is often used in FPGAs, in which each computing unit is attached with its own switch fabrics to transmit/receive data or to relay data to adjacent nodes. The mesh network offers regular structure and is easy to scale. But it suffers from longer delays and complex control logics. In realizing these facts, a compromised hierarchical routing structure with three-level networks is designed for the SmartCell: the fully connected crossbar unit for intracell data exchange, the static nearest neighbor connection for intercell communications, and the reconfigurable modified CMesh network for concurrent data communication among nonadjacent cell units.
3.3.1. Fully Connected Crossbar Intracell Interconnection
Initially, a centralized shared register memory (SRM) block is designed for the intracell communications. But it was abandoned due to its high area and power costs and complex memory access controls. In the current design, the PEs and instruction memories are placed at the four edges in a cell. A fully connected crossbar switch box is able to provide a nonblocking data exchange connection. Compared to the SRM implementation, the control logics are substantially simplified in the crossbar connection, which in turn results in a better timing and area performance.
3.3.2. Static Nearest Neighbor Intercell Interconnection
In our system, the homogeneous cell units are tiled in a 2D mesh structure. Thus the adjacent cells can be connected directly through short wires. Since four PEs in a cell are placed at four edges, each PE can be directly linked to the nearest PE located in the adjacent cell, as shown in Figure 1. This static network supports single cycle bidirectional data transmission of 2 16-bit and 1 36-bit signals between connected PEs. These signals are aligned up with the cell's internal signals to provide the PE's inputs. No extra synchronizations and delays are involved. This low latency and self-synchronization feature is critical for exploration of task level parallelism among cell units in many multimedia and signal processing applications.
3.3.3. Reconfigurable Hierarchical CMesh Network
Dimension-Order Routing (DOR) is implemented to route data firstly in one direction and then in another for multihop data transmissions. Because no closed loop can be generated, the DOR scheme guarantees no traffic contention exists. Instead of arranging the routers in a ring style as in traditional CMesh network, high-level routers that connect four local routers are designed and chained together to form another CMesh network. It is so-called hierarchical CMesh network. This hierarchical CMesh network reduces the number of long wires compared with traditional CMesh. Also, the same routing components can be easily added to or reduced from the SmartCell architecture for system scalability. In our design, each cell can receive a 36-bit signal through the CMesh network every clock cycle, which leads to a single-hop system throughput of 57.6 Gbits/s for a 4 by 4 SmartCell operating at 100 MHz.
3.4. Configuration and Control Flow
However, the configuring propagation latency is not the same for different PE/Cell units along the SPI chain: the nearer to the input port, the faster the configuration can be done. To compromise this unbalanced configuration latency, a memory partitioning scheme is developed in our design. In this scheme, new instruction codes can be loaded into the unused context memories while the PEs are still operating in the current contexts. Once the new contexts are fully loaded, a global select signal is used to indicate the switch of operation context. The configuration latency is hidden by this means. Multiple contexts can be efficiently swapped within one clock cycle.
4. A Case Study: Mapping of Matrix Multiplication onto SmartCell
A broad range of complex scientific and multimedia applications strongly depends on the performance of matrix-matrix multiplication. In this section, we use matrix multiplication as a case study to demonstrate how to map applications onto the SmartCell and to analyze its performance.
Various methods have been proposed in literature for high-performance matrix multiplication designs, such as Cannon's algorithm , Strassen's algorithm , and more recently systolic algorithms using special systolic arrays. A subblock matrix multiplication scheme is adopted in our design. In this scheme, the operand matrices are partitioned into smaller submatrices, each of which is then processed separately by different hardware resources in parallel. The result matrix is generated into subblocks of regular dense matrix. This scheme can be efficiently mapped onto our SmartCell system to explore both spatial and temporal parallelisms for high computing performance. At the same time, it achieves good data reusability among hardware resources to ease the stress of external memory bandwidth requirement.
To evaluate its performance, a 32 by 32 square matrix multiplication is mapped onto a 4 by 4 SmartCell system. In general, each final element requires 32 clock cycles to finish the 32 MAC operations involved in it. Due to the fully pipelined structure, a 4 by 4 subblock result can be calculated in a single cell within 128 clock cycles. The final 32 by 32 matrix C is decomposed into 64 independent 4 by 4 subblocks, which can be calculated by the available 16 cells in parallel. Thus a total number of 512 clock cycles is needed to compute a single 32 by 32 matrix multiplication, which leads to a system throughput of 195.3 KMatrices/s running at 100 MHz.
5. Hardware Synthesis Results and Performance Comparisons
Application domain and test benches.
64-tap FIR, 32-tap IIR
64-point FFT, 8 by 8 2D-DCT,
and image processing
8 by 8 Motion Estimation (ME)
in 24 by 24 area
128 by 128 MMM,
order Polynomial Evaluation (PoE)
5.1. SmartCell Prototype
System design and simulation parameters.
4 by 4
ModelSim, Synopsys CAD tools
Worst case condition
For power consumption and system throughput evaluations, all benchmarks are simulated at the same operating frequency of 100 MHz. The same simulation frequency was also used by RaPiD for its power consumption analysis. Because the RaPiD was designed in process and was operated at 3.3 V, a fair comparison requires scaling down its power consumption by a reasonable factor. In our study, full scaling  is performed that scales down the power consumption from 3.3 V to 1 V by a factor of . By this means, the process dimension is also scaled down to 0.15 To compensate the effect of dimension scaling, constant voltage scaling  is then performed to scale up the power consumption by a factor of 1.7. Therefore, the RaPiD power consumption is scaled down by an overall factor of 9.34.
The same benchmarks are also directly implemented on the FPGA platform to provide performance comparison. The state-of-the-art Altera's Stratix II FPGA in 90 nm process technology is selected as the benchmark platform. In particular, an EP2S30 FPGA device is used, since it is the smallest Stratix II FPGA that contains the same number of multipliers as in the SmartCell system. The benchmarks are designed in Altera's Quartus II 6.1 CAD tool and simulated at 100 MHz in ModelSim. The PowerPlay Analyzer is used to evaluate the power consumption based on the switching annotation generated from the gate level simulations. For fair comparison, only the core power consumption is recorded in the FPGA implementation, since the I/O and aux power is not included in others.
The ASIC implement is also generated for each test bench using the same HDL code as in the FPGA designs. It is expected to provide the best power/energy performance at a cost of flexibility. We use the same 0.13 process technology as in the SmartCell. Due to the large set of benchmarks under test, standard cell circuits are generated automatically by the Synopsys CAD tools without custom optimizations. We estimate the power consumption of the ASICs based on the gate level simulations at 100 MHz. Similarly, only the logic core power is recorded as done in the other platforms.
5.2. SmartCell Area, Timing, and Power Consumption Performance
The SmartCell can operate at a maximum frequency of about 123 MHz. Further investigation reveals that the single cycle MAC unit inside the arithmetic component takes about 5.5 nanoseconds of the total critical path delay, with 3.2 nanoseconds for the 18-bit multiplier and 2.3 nanoseconds for the 36-bit adder. Again, custom optimizations can be performed to improve the timing result such as using pipelined multiplier and carry lookahead adder. Also the 123 MHz maximum frequency is captured from the worst case critical path delay. Register delay is available between the MULT and ADD components to break the MAC operation into 2 cycles. Therefore, the critical path that is used for most benchmarks is shorter than that reported by the CAD tools. The configuration time is another important matrix in reconfigurable architecture designs. In SmartCell system, the full chip fine-grain configuration can be completed within 13 microseconds at 100 MHz. Dynamic reconfiguration can be much faster, depending on how different the new configuring context is. For example, 64 cycles are needed to reconfigure the SmartCell from 2D-DCT to 64-tap IIR applications. Furthermore, if both configuring contexts have already been preloaded into the instruction memory, only one cycle is required to switch between IIR and DCT applications using the memory partitioning scheme. For most applications under test, the SmartCell can be dynamically reconfigured in less than 1 microsecond, which is much faster than the fine-grained FPGA reconfigurations.
SmartCell power consumption and energy efficiency of different benchmarks at 100 MHz.
5.3. Comparison of Power/Energy Consumption with FPGA and ASICs
SmartCell power consumption and energy efficiency of different benchmarks at 100 MHz.
A more meaningful figure is depicted in Figure 10(b) that compares the average energy efficiency (GOPS/W) among the evaluated platforms, normalized to FPGA result. As expected, the ASICs are the most energy efficient among the evaluated platforms. It provides an average 16.4x energy efficiency gain compared with the FPGA result. However, this performance gain is achieved at a cost of no postfabrication flexibility and high engineering design cost. The energy performance provided by the SmartCell falls somewhere in between. It is about 4.1x more energy efficient than FPGA and is about 4x less than the ASIC implementations. This result demonstrates that the coarse-grained architecture is able to fill the energy efficiency gap between the fine-grained FPGA and logic specific ASIC architectures.
5.4. Comparison with Other CGRA Systems
Power and energy comparison among the evaluated CGRA systems.
As listed in Table 7, five benchmarks have been mapped onto the SmartCell system. Three of them are shared by the RaPid and Montium, individually. Montium achieves the best power consumption performance, since limited computing resources of only 5 ALUs are provided. The cycle column in the table denotes the number of clock cycles needed to compute one data block, except for the FIR filter design. In the 20-tap FIR benchmark, 2 blocks of 512 samples are used to generate the cycle and energy figures, as done in . The results demonstrate that in most applications, the SmartCell requires less clock cycles to finish the same amount of task comparing with the RaPid and Montium implementations. This is benefitted from more processing parallelism provided by SmartCell to reduce the computing complexity. For example, in the SmartCell implementation, three data pipes can be created to process the 20-tap FIR filter in parallel. On the other hand, a recursive processing scheme is adopted in Montium, since at most 5-tap FIR can be handled at the same time. This recursive scheme also involves extra control and data exchange overheads. The energy consumption is also compared in Table 7, which is computed as the product of average power consumption and number of clock cycles.
Figure 11(b) compares the normalized system throughput of different platforms. The SmartCell and the RaPid provide same throughput in motion estimation application, due to similar algorithm mapping structures. SmartCell outperforms both Montium and RaPid in all other benchmarks regarding to system throughput. In the FIR application, the SmartCell is about 6x faster than Montium system. SmartCell also shows a maximum throughput gain of 4.2x over RaPid system in the matrix multiplication implementations. Averagely, SmartCell provides about 4.0x and 2.2x throughput gains against the Montium and RaPid systems, respectively.
5.5. Development of Software Environment
Two phases are included in this design flow: application and architecture analysis phase (Phase I) and application mapping phase (Phase II). During Phase I, a high-level application description file (preferably in C language) is input into the Smart_C environment. An application abstraction step is performed to parse the input application file and to extract the work loads from it. All candidate loops are broken into linear sequences. The data dependencies among them are also analyzed during this step. On the other side, a hardware description and system requirement file are input to generate the hardware abstract, which specifies the computing resources and IO models for all available PEs. At last, the parallelism/pipeline exploration and application partitioning are performed to create scheduling code based on the hardware and software abstracts. The communication flow among active PEs are also scheduled here. The second application mapping phase transforms the scheduling code into configuring contexts that can be directly loaded into the instruction memories. Firstly, the control signals are determined for every computing and communication component to form the desired application datapath. Two modes are provided for offline and online configurations. In the offline configuration, the context file can be directly downloaded into the instruction memories for all active PEs. On the other hand, if the online configuration is performed, the differences between the current context and the generated one are examined. Only the PEs observing different contexts are needed to be updated.
At the current stage, the application mapping environment (Phase II) has been implemented to generate the configuration contexts from an input assembly code. Due to the regular system structure and data flow pattern involved in the targeted application domain, the cell level configuration can be generalized to use the same prototype models. In this case, we create two sets of computation and communication libraries, each of which specifies all available computing operations and I/O models for a cell unit. Given an application, the designer is responsible to properly partition the kernel operations onto the PE components and to explore the data flows among them. After that, an application description code can be created in representing of the computing and I/O models specified in the assembly libraries. According to these models, the context generator is able to automatically decode the control signals for the computing components from the input assembly code. The interconnection context can also be generated based on the input I/O models and the system architecture configuration. By this means, the configuration overhead can be greatly reduced. The following steps remain the same as described earlier. The development of the system analysis phase (phase I) involves lots of experiences on system and task level profiling, analysis, and optimizations as usually found in complex compiler designs. Benefitting from the regular tile structure and uniformed control logics, SmartCell can be configured for different system requirements of high performance or ultralow power consumption. The compiler needs to be robust enough to take advantage of hardware flexibility. One possible solution is that the compiler can read in a system constraint file, providing the system requirements, available hardware resources, targeted frequency, and so forth, based on which the configuration contexts are generated to satisfy these system requirements. Hardware virtualization also needs to be handled by the compiler to breakdown large computing tasks into smaller ones that can be fitted on the hardware resources and be processed individually. The optimization of task partitioning and scheduling needs to be properly addressed in the compiler design to explore both spatial and temporal parallelism potentially offered by SmartCell. Other issues such as loop breaking, redundancy optimization also need to be addressed in the compiler algorithm design. These challenges lead to couple interesting directions for our future work.
This paper presents the SmartCell as a novel reconfigurable architecture for stream-based applications. It is a coarse-grained architecture that tiles a large number of processor elements with reconfigurable communication fabrics. A prototype with 64 PEs is implemented with TSMC 0.13 technology. This chip consists of about 1.6 million gates with an average power consumption of 1.6 mW/MHz for the evaluated benchmarks. The benchmarking results show that SmartCell is able to bridge the energy efficiency gap between the fine-grained FGPAs and customized ASICs. When compared with Montium and RaPid, SmartCell shows 4x and 2x throughput gains and is about 8% and 69% more energy efficient, respectively. The performance results show that SmartCell is a promising reconfigurable and energy efficient platform for stream processing.
This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) Young Faculty Award under Grant W911NF-07-1-0191-P00001, and by the National Science Foundation (NSF) through award ECS-0725522.
- Khailany B, Dally WJ, Kapasi UJ, et al.: Imagine: media processing with streams. IEEE Micro 2001,21(2):35-46. 10.1109/40.918001View ArticleGoogle Scholar
- Fisher C, Rennie K, Xing G, et al.: Emulator for exploring RaPiD configurable computing architectures. Proceedings of the 11th International Conference on Field-Programmable Logic and Applications, 2001 17-26.View ArticleGoogle Scholar
- Mirsky E, DeHon A: MATRIX: a reconfigurable computing architecture with configurable instruction distribution and deployable resources. Proceedings of IEEE Symposium on FPGAs for Custom Computing Machines, 1996 157-166.View ArticleGoogle Scholar
- Schmit H, Whelihan D, Tsai A, Moe M, Levine B, Taylor RR: PipeRench: a virtualized programmable datapath in 0.18 micron technology. Proceedings of the Custom Integrated Circuits Conference, 2002 63-66.Google Scholar
- Singh H, Lee M-H, Lu G, Kurdahi FJ, Bagherzadeh N, Filho EC: MorphoSys: an integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Transactions on Computers 2000,49(5):465-481. 10.1109/12.859540View ArticleGoogle Scholar
- Taylor MB, Kim J, Miller J, et al.: The RAW microprocessor: a computational fabric for software circuits and general-purpose programs. IEEE Micro 2002,22(2):25-35. 10.1109/MM.2002.997877View ArticleGoogle Scholar
- Marshall T, Stansfield L, Vuillemin J, Hutchings B: A reconfigurable arithmetic array for multimedia applications. Proceedings of the ACM/SIGDA 7th International Symposium on Field Programmable Gate Arrays, 1999 135-143.Google Scholar
- Xilinx http://www.xilinx.com/products/virtex4/index.htm
- Xilinx http://www.xilinx.com/products/virtex5/index.htm
- Altera http://www.altera.com/products/devices/stratix2/st2-index.jsp
- Kuon I, Rose J: Measuring the gap between FPGAs and ASICs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2007,26(2):203-215.View ArticleGoogle Scholar
- Hartenstein R: A decade of reconfigurable computing: a visionary retrospective. Proceedings of IEEE Conference and Exhibition on Design, Automation and Test in Europe, 2001 642-649.Google Scholar
- Becker J, Vorbach M: Architecture, memory and interface technology integration of an industrial/academic configurable system-on-chip (CSoC). Proceedings of the IEEE Computer Society Annual Symposium on VLSI, 2003 107-112.Google Scholar
- DeHon A, Markovsky Y, Caspi E, et al.: Stream computations organized for reconfigurable execution. Microprocessors and Microsystems 2006,30(6):334-354. 10.1016/j.micpro.2006.02.009View ArticleGoogle Scholar
- Kim Y, Kiemb M, Park C, Jung J, Choi K: Resource sharing and pipelining in coarse-grained reconfigurable architecture for domain-specific optimization. Proceedings of Design, Automation and Test in Europe (DATE '05), 2005 1: 12-17.Google Scholar
- Zawodny J, Kogge P: Cache-in-memory. Innovative Architecture for Future Generation High-Performance Processors and Systems, January 2001, Maui, Hawaii, USA 3-11.Google Scholar
- Draper J, Sondeen J, Mediratta S, Kim I: Implementation of a 32-bit RISC processor for the data-intensive architecture processing-in-memory chip. Proceedings of the IEEE Low Power Electronics and Design, 2005 161-166.Google Scholar
- Lanuzza M, Margala M, Corsonello P: Cost-effective low-power processor-in-memory-based reconfigurable datapath for multimedia applications. Proceedings of the International Symposium on Low Power Electronics and Design, 2005 161-166.Google Scholar
- Khawam S, Arslan T, Westall F: Synthesizable reconfigurable array targeting distributed arithmetic for system-on-chip applications. Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS '04), 2004 2051-2058.Google Scholar
- Bouwens F, Berekovic M, Kanstein A, Gaydadjiev G: Architecture exploration of the ADRES coarse-grained reconfigurable array. Springer Reconfigurable Computing: Architectures, Tools and Applications, 2007 1-13.View ArticleGoogle Scholar
- Smit LT, Rauwerda GK, Molderink A, Wolkotte PT, Smit GJM:Implementation of a 2-D IDCT on the reconfigurable Montium core. Proceedings of International Conference on Field Programmable Logic and Applications (FPL '07), 2007 562-566.
- Balfour J, Dally WJ: Design tradeoffs for tiled CMP on-chip networks. Proceedings of the 20th International Conference on Supercomputing, 2006 187-198.Google Scholar
- Cannon L: A cellular computer to implement the kalman filter algorithm, Ph.D. thesis. Montana State University, Bozeman, Mont, USA; 1969.Google Scholar
- Strassen V: Gaussian elimination is not optimal. Numerische Mathematik 1969, 13: 354-356. 10.1007/BF02165411MATHMathSciNetView ArticleGoogle Scholar
- Kang SM, Leblebici Y: CMOS Digital Integrated Circuits Analysis and Design. 3rd edition. McGraw-Hill, New York, NY, USA; 2002.Google Scholar
- Heysters PM, Smit GJM, Molenkamp E: Energy-efficiency of the Montium reconfigurable tile processor. Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA '04), 2004 38-44.Google Scholar
- Cronquist D, fisher C, Figueroa M, Franklin P, Ebeling C: Architecture design of reconfigurable pipelined datapaths. Proceedings of the 20th Anniversary Conference on Advanced Research in VLSI, 1999 23-40.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.