Implementation of a reconfigurable ASIP for high throughput low power DFT/DCT/FIR engine
© Hassan et al; licensee Springer. 2012
Received: 19 June 2011
Accepted: 5 April 2012
Published: 5 April 2012
In this article we present an ASIP design for a discrete fourier transform (DFT)/discrete cosine transform (DCT)/finite impulse response filters (FIR) engine. The engine is intended for use in an accelerator-chain implementation of wireless communication systems. The engine offers a very high degree of flexibility, accepting and accelerating performance approaches that of any-number DFT and inverse discrete fourier transform, one and two dimension DCT, and even general implementations of FIR equations. Performance approaches that of dedicated implementations of such algorithms. A customized yet flexible redundant memory map allows processor-like access while maintaining the pipeline full in a dedicated architecture-like manner. The engine is supported by a proprietary software tool that automatically sets the rounding pattern for the accelerator rounder to maintain a required signal to quantization noise or output RMS for any given algorithm. Programming of the processor is done through a mid-level language that combines register-specific instructions with DFT/DCT/FIR specific-instructions. Overall the engine allows users to program a very wide range of applications with software-like ease, while delivering performance very close to hardware. This puts the engine in an excellent spot in the current wireless communications environment with its profusion of multi-mode and emerging standards.
The rapid increase in the performance demand of wireless communication systems combined with the proliferation of standards both finalized and unfinalized has increased the need for a paradigm shift in the design of communication system blocks. Recent trends favor Software Defined Radio (SDR) systems due to their scalability and the ability to support multiple standards on the same platform. However, keeping performance within acceptable levels while doing this is a challenging research question.
Different approaches have been taken to address this question. Authors of [1–3] used Digital Signal Processors (DSPs) owing to their high configurability and adaptive capabilities. Although DSP performance is improving, it is still impractical due to its high power consumption and low throughput. On the other hand [4, 5] used configurable HW systems due to the high performance afforded by such platforms. However, these designs fail to catch up with the rapid growth in communication standards; they only support a limited class of algorithms for which they are specifically designed. Application specific instruction processors (ASIPs) offer an interesting position between the two approaches, allowing programming-like flexibility for a certain class of applications under speed and power constraints.
Different approaches to ASIPs offer different levels of flexibility. For example: [6–8] proposed an ASIP design which has the reconfigurability to support all/some functions of the physical layer Orthogonal Frequency Division (OFDM) receiver chain including OFDM Modulation/Demodulation, channel estimation, turbo decoder, etc. This reconfigurability between non-similar functions has a severe effect on performance, lowering throughput, raising power, or both. Realizing that these blocks operate simultaneously in a pipeline in an OFDM receiver, a different approach to partitioning the problem can be taken.
This chapter proposes the OFDM Modulation/Demodulation block which is basically based on Discrete Fourier Transform (DFT) and extended to support similar transformations like Discrete Cosine Transform (DCT) and finite impulse response filters (FIR). DFTs, DCTs, and FIRs are used in innumerable communication and signal processing applications. For example: the DFT is commonly used in high data rate Orthogonal Frequency Division Multiplexing (OFDM) systems such as Long Term Evolution (LTE), WiMax, WiLAN, DVB-T, etc; one of the main reasons is to increase robustness against frequency selective fading and narrow-band interference. One and two dimensional DCT are often used in audio and image processing systems such as interactive multimedia, digital TV-NTSC, low bit rate video conferencing, etc; owing to its compaction of energy into the lower frequencies. Finally FIR, is commonly used in digital signal processing applications that have a frequency spectrum with a wide range of frequency to filter frequency components by isolation, rejection or attenuation depending on system implementation.
1.1 Paper overview
We build on previous studies in [9, 10] where we presented a memory based architecture controlled by an instruction set processor. In this study we combine all elements of the design: performing further optimization on the processing elements (PE) to increase their flexibility and performance; as well as presenting a complete implementation including the full memory map and the programming front-end.
The supported mathematical algorithms are discussed in Section 2, This is followed by the system architecture and embedded processor in Section 3. The hardware (HW) accelerators in Section 4, and engine programing with coding example in Section 5. Section 6 details ASIC results and comparison among previously published designs. Section 7 concludes the article.
2 Supported algorithms
The engine can support multiple algorithms some of these algorithms are listed below.
The direct implementation of Equation (1) is O(N2) which makes it difficult to meet typical throughput requirements. Common DFT symbol length in different communication and signal processing standard is in form 2 x except LTE down link which supports length 1536 = 29 × 3. Thus optimizing the throughput of a 2 x × 3 y -point DFT is our main concern.
For lower computation cycle counts, Higher radix algorithm should be used. In practice, the radix-2 algorithm throughput requires four times the number of cycles than the radix-4 algorithm and radix-4 algorithm requires four times the number of cycles of the radix-8 algorithm. On the other hand, higher radix implementations have big butterflies thus they consume higher power and need more complex address generators to handle data flow.
Energy consumed for N-point FFT vs.
Number of butterflies
Number of stages
Number of butterflies
Number of clock
Total number of clock
cycles for N -point FFT
0.5 × N × x
0.375 × N × x
0.43 × N × x
As N = 2 x
Fix the address generators complexity, by assuming the data are read from memory 4 samples by 4 samples.
Normalize butterfly power by number of complex multipliers on it, which is the the dominant power consumer in the butterfly.(2)
2.2 Inverse DFT
Then the output of DFT(v n ) is multiplied by .
2.4 Inverse DCT
Then construct v(n) by getting the IDFT of V (k), finally rearrange v(n) to get x(n).
where a's are the filter coefficients.
2.6 Other transformations
Other transformations like any-point DFT can also be handled using basic operations like MAC, accumulator and vector operations.
3 ASIP processor
The first step in the design of a flexible and efficient ASIP is to identify the common set of operations in the class of operations which must be supported. The computationally intensive operations are defined as coefficient-generation, address-generation, and PE. These operations are supported by HW acceleration.
To meet the high throughput demand, data operations are handled through vector instructions. Synchronization in the processing pipeline is handled through handshakes between the system blocks. This greatly reduces the load on decoders, allowing continuous flow in the pipeline and providing dedicated design-like throughput.
The critical path in the PE is relatively short. This simplicity combined with the high throughput of the pipeline allows the user to greatly under clock the circuit, thus allowing significant power scaling with application.
The register file is divided into even and odd sets, the real parts of complex words are saved in the even registers and the imaginary parts in following odd registers. Complex words are called by their real register number while a real word may be saved in any register and called by its index.
The instruction set is divided into five classes:
Radix instructions like: Radix-2/3/4, Inverse Radix-2/3/4 used for DFT.
MAC instructions for FIR: multiply two data vectors and accumulate, multiply data vector by coefficient and accumulate.
Vector Multiplications instructions For DCT/IDCT: multiply by coefficient,
Vector instructions like: accumulate, power, energy, addition, subtraction, multiplication, multiply by coefficient used to perform general vector arithmetic.
Word instructions like: shift, set, load, store, complex or word addition, subtraction, multiplication used mostly for control.
Data transmission instructions like: data arrangement, data importing and exporting.
Control instructions like: compare, conditional/unconditional branches, disable/enable dealing with imaginary part.
All vector instructions are applicable on complex words and have the ability to define the order in which data is read or written. MAC instructions are used for general implementations of FIR equations. MAC allows multiplication of data by data or data by stored or generated coefficients. MAC and Vector multiplication instructions allow multiplication by coefficients or their inverse for general transformations purpose.
4 Hardware accelerators
4.1 Processing element
4.2 Coefficient generator
The coefficient generator generates needed coefficients in two modes.
For we invert the imaginary part's sign. For radix-4 computations we need to generate three twiddle factors at a time, so we use two memories, the first memory is a dual-port RAM and is used to generate and . The second memory is a single-port RAM which is used to generate . For frame lengths with x > 2, the 2nd memory addresses are always even. So we remove all odd entries. This reduction adds a negligible noise in x = 2 case. For frame lengths with x = 1 we replace N by N'(N' = 4 × 3 y ). Consequently we save the first N'/ 4 = 3 y coefficients in RAM. To save power the 2nd memory is enabled only in radix-4 stage.
This method reduces coefficient memory size to 18% of a direct LUT implementation.
Mode two: Read stored coefficients from the first RAM starting from selected address and going in ascending or descending order depending on selected mode. This is more suitable for FIR transformation and direct implementations of general filters.
4.3 Read and write address generators
Each generator is connected to a single port RAM to get off-line generated addresses. Read and write address memories hold two addresses in each entry. To enable reading four sequential addresses in one clock cycle, write address memory is divided into two single port RAMs, one for odd entries and another for even entries.
The address generation modes are defined as:
Mode one: Generate addresses for different radix-r stages. Radix-4, 2 and 3 need to read 4, 4 (tworadix-2 handled in parallel), 3 data samples respectively for their computations.
This can be handled in several ways: Read data from memory 2 samples by 2 samples with 2 clock latency for each radix operation, double memory clock frequency and read 2 samples by 2 samples with 1 clock latency for each radix operation at the expense of double memory power, or use 4-port memories. Each of the above techniques have drawbacks to different degrees like lower throughput, power or both. In  we proposed an address scheme to solve the above problem with conflict-free memory access. The scheme is contingent on partitioning the memory to 4 dual-port memory banks as well as the specific way data is distributed between the banks. This guarantees that at any stage we have at most two accesses to the same memory bank.
Initially data is saved and distributed between memory banks to be ready for the first radix stage (radix-4 or 3). As N = 2 x × 3 y , (x ≠ 1), if x is even (integer stages from radix-4) the butterfly performs radix-4 computations till the end then switches to perform radix-3 stages. Else (if x odd) the butterfly performs radix-4 stages followed by radix-2 then switches to perform radix-3 stages. Switching to radix-3 stages consumes a one-time additional stage to rearrange data in memory banks. At last radix-r stage, radix output is saved in the same locations of radix inputs.
Mode two: Generate addresses for DCT/IDCT modes to arrange data in v n and V k order.
Each input sample is saved in two locations in memory and the data is imported 2 samples by 2 samples in order to reduce data transmission time. So, data is distributed in memory banks to prevent memory conflict.
Mode three: Generate addresses for other vector instructions like MAC. It generates two addresses for two data vectors or one data vector (two samples at time) in-order sequence or get them form memory. With start addresses and data length as an input parameters.
5 Engine programming
The embedded processor programming passes through three phases: Simulation, testing, and verification. We will discuss 1024-point DFT, 8 × 8 DCT and 64 tap FIR as case studies.
The goal of these simulations is to find best values of our design which are: the scale factor which we divide on each radix-r stage, word length and coefficient factors length.
5.1.1 Scale factor
Select the Highest SQNR.
Guaranteed output RMS, to keep signal peaks which are needed in same applications.
The tool are designed by Matlab software, it generate all possible scale factors with corresponding signal to quantization noise (SQNR) and the RMS of the output then select the best scale vector Depending on the input mode. Using the first mode for our example reveals scale factors of (4 4 2 2 2) for 1024 point five stages, giving the highest SQNR with a Gaussian input.
5.1.2 Word and coefficient lengths
Code for the application is written using custom mnemonics that combine HW-specific instructions with application-specific instructions. This then passes through a assembly compiler (designed by Matlab software) which generates the boot-loading and program object files.
When processing begins, the decoder accesses address zero in the boot loading ROM and reads initialization instructions. These instructions are mainly used for loading data and instructions from flash memory to the corresponding RAM memory in the system. Upon finishing, the decoder jumps to program memory and starts processing.
5.2.1 1024-point DFT
8 × 8 DCT
5.2.2 FIR filter
Verification of these and other examples is through bit-matching the results of random input patterns with fixed-point results from fixed point golden files. The golden files are verified and tested against a floating point model to make sure they perform the needed tasks. The golden files are used to verify the RTL design by generating test cases, both directed and random.
6 Implementation results and performance evaluation
Synthesis results (with memories)
Up to 8K point-DFT 1D symbol 26 complex word length
IBM 130 nm CMOS technology (6 layers)
Gates libraries: Typical (55°c)
Fast library(125°c) used for worst case conditions
Memories library: (125°c)
Number of Cells
0.612 × 0.6 (0.36) mm2
56 mw at 100 MHz
6.2 Performance evaluation
Number of clock cycles and SQNR for 1D-DFT including data transfer times between the embedded engine and the host
Number of clock cycles for 1D-DCT including data transfer times between the embedded engine and the host
Cyles per DCT
Latency @ 100 MHz (μ s)
Number of clock cycles for 2D-DCT including data transfer times between the embedded engine and the host
N × N-point DCT
Cycles per DCT
Latency @ 100 MHz (μ s)
8 × 8
16 × 16
32 × 32
64 × 64
Number of clock cycles and SQNR for 1D-DFT including data transfer times between the embedded engine and the host
Time to end
Weidong and Wanhammar  proposed an pipeline ASIC for pipeline FFT processor. Here we prove our discussion in Section 3, the pipeline architecture have a higher throughput but loss on power efficiency.
The authors of [5, 18, 19] proposed memory based Application-Specific Integrated Circuit (ASIC) for scalable DFT engine. The proposed engine in  enables runtime configuration of the DFT length, where the supported lengths vary only from 16-points to 4096. while the proposed engine in  enables reconfigurable FFT Processor, the FFT lengths vary only from 128-points to 8192. and  can perform 64 2048-point FFT. This engines have high throughput rates. But, they only support certain kinds of algorithms for which they are designed.
In contrast,  used digital signal processors owing to their high reconfigurability and adaptive capabilities. Although DSP performance is improving, it is still unsuitable due to its high power consumption and low throughput. Hsu and Lin  proposed an approach for DFT implementation on DSP with low-memory reference and high flexibility, however it is optimized for 2 x -point DFT, It needs 40,338 cycles to complete one 1024-point DFT.
The third solution, [20, 21] is the ASIP which compromises between the above solutions. Zhong et al.  proposed an DFT/IDFT processor based on multi-processor rings. This engine presents four processor rings (8, 16-Point FFT) and supports DFT lengths from 16-points to 4096. Guan et al.  proposed an ASIP scalable architecture of any-point DFT at the expense of a large PE (contains an 8-point butterfly). the authors present only the power consumption of functional unit and data address generator so we did not include it in the table.
Shah et al.  presents a pipelined scalable any-point DFT 1D/2D engine which requires 256 clock cycles for (16 × 16)-DFT 2D, while  and this design require 512 cycles. Nevertheless, Sohil Shah's proposal has higher area.
For DCT-1D, we use the mathematical algorithm in  which implements ASIC DCT-1D bulting blocks common with DFT. The engine has a throughput of one 512-point DCT per 1,771 cycles, and one 1024-point DFT per 3435 cycles.
For DCT-2D existence designs, the engine in  has been tailored to a particular application needing 80 cycle for (8 × 8)-DCT 2D, and programmable DSP  supports scalable (N × N)-DCT 2D as N = 4-64. needs 2,538 cycles for (16 × 16)-DCT 2D,
More power efficient than most of other proposed architectures in the literature.
Could be support many OFDM Systems with relatively low power.
High reconfigurability which allows users to program a very wide range of applications with softwarelike ease.
Support peripheral operations beside the main processes like CP remover which was need in the proposed WiMAX demo.
Simple interfaces (FIFO interface) which handle data transfer between the engine and asynchronous blocks with different clock domains.
The engine parameters like the number of bits, memories size and types are parameterized to meet different requirements and higher symbol lengths
A new address generation scheme allows reading and writing the butterfly data in one clock cycle which allow performing 1 butterfly operation each clock. This reduce processing time by 50% without doubling the clock frequency no loss on power.
The selection of radix-4 algorithm which have best power efficiency.
Using HW accelerators accelerate the processing and reduce the complicity of the decoder.
Using pipeline processing of the vector instructions is also accelerate the processing.
Using simultaneously input and output data transformations with four data buses which reduce data transformations time by 75%.
Reduce time to market by supporting a compiler tool for the engine with a simple instruction set
The use of classified engines allows high degree of optimization.
Applications that can be supported and the corresponding estimated clock frequency
LTE, WI-MAX, WLAN, DVB-T, DVB-T, DVB-H, DAB, ADSLs and VDSL
Low bit rate video conferencing, basic video telephony, interactive multimedia and digital TV-NTSC
This study was part of a project supported by a grant from STDF, Egypt (Science and Technology Development Fund).
- Liu X, Wang Y: Memory Access Reduction Method for efficient implementation of Vector-Radix 2D fast cosine transform pruning on DSP. Proceedings of the IEEE SoutheastCon 2010, 68-72.Google Scholar
- Hsu YP, Lin SY: Implementation of Low-Memory Reference FFT on Digital Signal Processor. Journal of Computer Science 2008, 7: 545-549.Google Scholar
- Frigo M, Johnson SG: The Design and Implementation of FFTW3. Proceedings of the IEEE 2005, 93: 216-231.View ArticleGoogle Scholar
- Jo BG, Sunwoo MH: New continuous-flow mixed-radix (CFMR) FFT processor using novel in-place strategy. IEEE Transactions on Circuits and Systems 2005, 52(5):911-919.MathSciNetView ArticleGoogle Scholar
- Jacobson AT, Truong DN, Baas BM: The Design of a Reconfigurable Continuous-Flow Mixed-Radix FFT Processor. IEEE International Symposium on Circuits and Systems ISCAS 2009, 1133-1136.Google Scholar
- Hangpei T, Deyuan G, Yian Z: Gaining Flexibility and Performance of Computing Using Application-Specific Instructions and Reconfigurable Architecture. International Journal of Hybrid Information Technology 2009, 2: 324-329.Google Scholar
- Poon ASY: An Energy-Efficient Reconfigurable Baseband Processor for Wireless Communications. (IEEE) Trans VLSI 2007, 15(3):319-327.View ArticleGoogle Scholar
- Iacono DL, Zory J, Messina E, Piazzese N, Saia G, Bettinelli A: ASIP Architecture for Multi-Standard Wireless Terminals. Design, Automation and Test in Europe (DATE '06) 2006, 2: 1-6.View ArticleGoogle Scholar
- Hassan HM, Shalash AF, Hamed HM: Design architecture of generic DFT/DCT 1D and 2D engine controlled by SW instructions. Asia Pacific Conference on Circuits and Systems APCCAS 2010 2010, 84-87.View ArticleGoogle Scholar
- Hassan HM, Shalash AF, Mohamed K: FPGA Implementation of an ASIP for high throughput DFT/DCT 1D/2D engine. IEEE International Symposium on Circuits and Systems (ISCAS) 2011 2011, 1255-1258.View ArticleGoogle Scholar
- Cooley JW, Tukey JW: An Algorithm for Machine Computation of Complex Fourier Series. Mathematics of Computation 1965, 19: 297-301.MATHMathSciNetView ArticleGoogle Scholar
- Nguyen T, Koilpillai RD: The theory and Design of Aribitrary-length cosine-modulated filter Banks and wavelets, satisfying perfect reconstruction. IEEE Transaction on signal processing 1996, 44(3):473-483.View ArticleGoogle Scholar
- Braganza S, Leeser M: The 1D Discrete Cosine Transform for Large Point Sizes Implemented on Reconfigurable Hardware. IEEE International Conference on Application-specific Systems, Architectures and Processors ASAP 2007, 101-106.Google Scholar
- Weidong Li, Wanhammar L: A PIPELINE FFT PROCESSOR. IEEE Workshop on Signal Processing Systems, 1999. SiPS 99 1999, 19: 654-662.Google Scholar
- Chidambaram R, Leuken RV, Quax M, Held I, Huisken J: A multistandard FFT processor for wireless system-on-chip implementations. Proc International Symposium on Circuits and Systems 2006, 47.Google Scholar
- He S, Torkelson M: Design and Implementation of a 1024-point Pipeline FFT Processor. Proceedings of the IEEE 1998 Custom Integrated Circuits Conference 1998, 131-134.Google Scholar
- Lin JM, Yu HY, Wu YJ, Ma HP: A Power Efficient Baseband Engine for Multiuser Mobile MIMOOFDMA Communications. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI 2010, 57: 1779-1792.MathSciNetView ArticleGoogle Scholar
- Sung TY, Hsin HC, Ko LT: Reconfigurable VLSI Architecture for FFT Processor. WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS 2009., 8:Google Scholar
- Lee YH, Yu TH, Huang KK, Wu AY: Rapid IP Design of Variable-length Cached-FFT Processor for OFDM-based Communication Systems. IEEE Workshop on Signal Processing Systems Design and Implementation, 2006. SIPS '06 2006, 62-65.View ArticleGoogle Scholar
- Zhong G, Xu F, Willson AN Jr: A power-scalable reconfigurable FFT/IFFT IC based on a multi-processorring. IEEE Journal of Solid-State Circuits (JSSC) 2006, 41: 483-495.View ArticleGoogle Scholar
- Guan X, Lin H, Fei Y: Design of an Application-specific Instruction Set Processor for High-throughput and Scalable FFT. IEEE International Symposium on Circuits and Systems ISCAS 2009, 2513-2516.Google Scholar
- Shah S, Venkatesan P, Sundar D, Kannan M: Low Latency, High Throughput, and Less Complex VLSI Architecture for 2D-DFT. International Conference on Signal Processing, Communications and Networking ICSCN 2008, 349-353.Google Scholar
- Shah S, Venkatesan P, Sundar D, Kannan M: A Fingerprint Recognition Algorithm Using Phase-BasedImage Matching for Low-Quality Fingerprints. IEEE International Conference on the Image Processing 2005, 33-36.Google Scholar
- Tumeo A, Monchiero M, Palermo G, Ferrandi F, Sciuto D: A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs. IEEE Computer Society Annual Symposium on VLSI 2007, 331-336.View ArticleGoogle Scholar
- Cho YJ, Yu CL, Yu TH, Zhan CZ, Wu AYA: Efficient Fast Fourier Transform Processor Design for DVB-H System. proc VLSI/CAD symposium 2007.Google Scholar
- sung TY: Memory-efficient and high-speed split-radix FFT/IFFT processor based on pipeline CORDIC rotations. IEEE proceedings, Image Signal Process 2006, 153: 405-410.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.