 Research
 Open Access
Implementation of a reconfigurable ASIP for high throughput low power DFT/DCT/FIR engine
 Hanan M Hassan^{1}Email author,
 Karim Mohammed^{1} and
 Ahmed F Shalash^{1}
https://doi.org/10.1186/1687396320123
© Hassan et al; licensee Springer. 2012
 Received: 19 June 2011
 Accepted: 5 April 2012
 Published: 5 April 2012
Abstract
In this article we present an ASIP design for a discrete fourier transform (DFT)/discrete cosine transform (DCT)/finite impulse response filters (FIR) engine. The engine is intended for use in an acceleratorchain implementation of wireless communication systems. The engine offers a very high degree of flexibility, accepting and accelerating performance approaches that of anynumber DFT and inverse discrete fourier transform, one and two dimension DCT, and even general implementations of FIR equations. Performance approaches that of dedicated implementations of such algorithms. A customized yet flexible redundant memory map allows processorlike access while maintaining the pipeline full in a dedicated architecturelike manner. The engine is supported by a proprietary software tool that automatically sets the rounding pattern for the accelerator rounder to maintain a required signal to quantization noise or output RMS for any given algorithm. Programming of the processor is done through a midlevel language that combines registerspecific instructions with DFT/DCT/FIR specificinstructions. Overall the engine allows users to program a very wide range of applications with softwarelike ease, while delivering performance very close to hardware. This puts the engine in an excellent spot in the current wireless communications environment with its profusion of multimode and emerging standards.
Keywords
 DFT
 DCT
 FIR
 ASIP
 reconfigurable hardware
1 Introduction
The rapid increase in the performance demand of wireless communication systems combined with the proliferation of standards both finalized and unfinalized has increased the need for a paradigm shift in the design of communication system blocks. Recent trends favor Software Defined Radio (SDR) systems due to their scalability and the ability to support multiple standards on the same platform. However, keeping performance within acceptable levels while doing this is a challenging research question.
Different approaches have been taken to address this question. Authors of [1–3] used Digital Signal Processors (DSPs) owing to their high configurability and adaptive capabilities. Although DSP performance is improving, it is still impractical due to its high power consumption and low throughput. On the other hand [4, 5] used configurable HW systems due to the high performance afforded by such platforms. However, these designs fail to catch up with the rapid growth in communication standards; they only support a limited class of algorithms for which they are specifically designed. Application specific instruction processors (ASIPs) offer an interesting position between the two approaches, allowing programminglike flexibility for a certain class of applications under speed and power constraints.
Different approaches to ASIPs offer different levels of flexibility. For example: [6–8] proposed an ASIP design which has the reconfigurability to support all/some functions of the physical layer Orthogonal Frequency Division (OFDM) receiver chain including OFDM Modulation/Demodulation, channel estimation, turbo decoder, etc. This reconfigurability between nonsimilar functions has a severe effect on performance, lowering throughput, raising power, or both. Realizing that these blocks operate simultaneously in a pipeline in an OFDM receiver, a different approach to partitioning the problem can be taken.
This chapter proposes the OFDM Modulation/Demodulation block which is basically based on Discrete Fourier Transform (DFT) and extended to support similar transformations like Discrete Cosine Transform (DCT) and finite impulse response filters (FIR). DFTs, DCTs, and FIRs are used in innumerable communication and signal processing applications. For example: the DFT is commonly used in high data rate Orthogonal Frequency Division Multiplexing (OFDM) systems such as Long Term Evolution (LTE), WiMax, WiLAN, DVBT, etc; one of the main reasons is to increase robustness against frequency selective fading and narrowband interference. One and two dimensional DCT are often used in audio and image processing systems such as interactive multimedia, digital TVNTSC, low bit rate video conferencing, etc; owing to its compaction of energy into the lower frequencies. Finally FIR, is commonly used in digital signal processing applications that have a frequency spectrum with a wide range of frequency to filter frequency components by isolation, rejection or attenuation depending on system implementation.
1.1 Paper overview
We build on previous studies in [9, 10] where we presented a memory based architecture controlled by an instruction set processor. In this study we combine all elements of the design: performing further optimization on the processing elements (PE) to increase their flexibility and performance; as well as presenting a complete implementation including the full memory map and the programming frontend.
The supported mathematical algorithms are discussed in Section 2, This is followed by the system architecture and embedded processor in Section 3. The hardware (HW) accelerators in Section 4, and engine programing with coding example in Section 5. Section 6 details ASIC results and comparison among previously published designs. Section 7 concludes the article.
2 Supported algorithms
The engine can support multiple algorithms some of these algorithms are listed below.
2.1 DFT
where: $\left\{\begin{array}{c}\hfill k=0,\dots N1\hfill \\ \hfill {W}_{N}={e}^{2\pi i/N}\hfill \end{array}\right.$
The direct implementation of Equation (1) is O(N^{2}) which makes it difficult to meet typical throughput requirements. Common DFT symbol length in different communication and signal processing standard is in form 2^{ x } except LTE down link which supports length 1536 = 2^{9} × 3. Thus optimizing the throughput of a 2^{ x } × 3^{ y }point DFT is our main concern.
For lower computation cycle counts, Higher radix algorithm should be used. In practice, the radix2 algorithm throughput requires four times the number of cycles than the radix4 algorithm and radix4 algorithm requires four times the number of cycles of the radix8 algorithm. On the other hand, higher radix implementations have big butterflies thus they consume higher power and need more complex address generators to handle data flow.
Energy consumed for Npoint FFT vs.
Algorithm  Radix2  Radix4  Radix8 

Number of butterflies  2  1  1 
Number of stages  log_{2}(N)  log_{4}(N)  log_{8}(N) 
Number of butterflies operations/stage  $\frac{N}{2}$  $\frac{N}{4}$  $\frac{N}{8}$ 
Number of clock cycles/butterflies  1  1  2 
Total number of clock cycles for N point FFT  $\frac{N}{4}{log}_{2}\left(N\right)\left(\frac{N}{4}\right)\left(x\right)$  $\frac{N}{4}{log}_{4}\left(N\right)\left(\frac{N}{4}\right)\left(\frac{x}{2}\right)$  $\frac{N}{4}{log}_{8}\left(N\right)\times 2\left(\frac{N}{4}\right)\left(\frac{x}{3}\right)$ 
Normalized power  2  3  7 
Power efficiency  0.5 × N × x  0.375 × N × x  0.43 × N × x 
As N = 2^{ x } 

Fix the address generators complexity, by assuming the data are read from memory 4 samples by 4 samples.

Normalize butterfly power by number of complex multipliers on it, which is the the dominant power consumer in the butterfly.$\text{Power}\phantom{\rule{0.3em}{0ex}}\text{efficiency}=\frac{\text{Power}}{\text{Throughput}}\approx \frac{\text{No}\phantom{\rule{0.3em}{0ex}}\text{of}\phantom{\rule{0.3em}{0ex}}\text{multipliers}}{1/\text{No}\phantom{\rule{0.3em}{0ex}}\text{of}\phantom{\rule{0.3em}{0ex}}\text{cycles}\phantom{\rule{0.3em}{0ex}}\text{to}\phantom{\rule{0.3em}{0ex}}\text{end}}$(2)
2.2 Inverse DFT
2.3 DCT
Then the output of DFT(v_{ n }) is multiplied by $2\omega \left(k\right){e}^{\frac{i2\pi k}{2N}}$.
2.4 Inverse DCT
Then construct v(n) by getting the IDFT of V (k), finally rearrange v(n) to get x(n).
2Dimension modes
2.5 FIR
where a's are the filter coefficients.
2.6 Other transformations
Other transformations like anypoint DFT can also be handled using basic operations like MAC, accumulator and vector operations.
3 ASIP processor
The first step in the design of a flexible and efficient ASIP is to identify the common set of operations in the class of operations which must be supported. The computationally intensive operations are defined as coefficientgeneration, addressgeneration, and PE. These operations are supported by HW acceleration.
To meet the high throughput demand, data operations are handled through vector instructions. Synchronization in the processing pipeline is handled through handshakes between the system blocks. This greatly reduces the load on decoders, allowing continuous flow in the pipeline and providing dedicated designlike throughput.
The critical path in the PE is relatively short. This simplicity combined with the high throughput of the pipeline allows the user to greatly under clock the circuit, thus allowing significant power scaling with application.
 (1)
The register file is divided into even and odd sets, the real parts of complex words are saved in the even registers and the imaginary parts in following odd registers. Complex words are called by their real register number while a real word may be saved in any register and called by its index.
 (2)
The instruction set is divided into five classes:

Radix instructions like: Radix2/3/4, Inverse Radix2/3/4 used for DFT.

MAC instructions for FIR: multiply two data vectors and accumulate, multiply data vector by coefficient and accumulate.

Vector Multiplications instructions For DCT/IDCT: multiply by coefficient,

Vector instructions like: accumulate, power, energy, addition, subtraction, multiplication, multiply by coefficient used to perform general vector arithmetic.

Word instructions like: shift, set, load, store, complex or word addition, subtraction, multiplication used mostly for control.

Data transmission instructions like: data arrangement, data importing and exporting.

Control instructions like: compare, conditional/unconditional branches, disable/enable dealing with imaginary part.
All vector instructions are applicable on complex words and have the ability to define the order in which data is read or written. MAC instructions are used for general implementations of FIR equations. MAC allows multiplication of data by data or data by stored or generated coefficients. MAC and Vector multiplication instructions allow multiplication by coefficients or their inverse for general transformations purpose.
4 Hardware accelerators
4.1 Processing element
4.2 Coefficient generator
The coefficient generator generates needed coefficients in two modes.
For ${e}^{j2\pi \frac{n}{N}}$ we invert the imaginary part's sign. For radix4 computations we need to generate three twiddle factors at a time, so we use two memories, the first memory is a dualport RAM and is used to generate ${e}^{j2\pi \frac{n}{N}}$ and ${e}^{j2\pi \frac{3n}{N}}$. The second memory is a singleport RAM which is used to generate ${e}^{j2\pi \frac{2n}{N}}$. For frame lengths with x > 2, the 2nd memory addresses are always even. So we remove all odd entries. This reduction adds a negligible noise in x = 2 case. For frame lengths with x = 1 we replace N by N'(N' = 4 × 3^{ y }). Consequently we save the first N'/ 4 = 3^{ y } coefficients in RAM. To save power the 2nd memory is enabled only in radix4 stage.
This method reduces coefficient memory size to 18% of a direct LUT implementation.
Mode two: Read stored coefficients from the first RAM starting from selected address and going in ascending or descending order depending on selected mode. This is more suitable for FIR transformation and direct implementations of general filters.
4.3 Read and write address generators
Each generator is connected to a single port RAM to get offline generated addresses. Read and write address memories hold two addresses in each entry. To enable reading four sequential addresses in one clock cycle, write address memory is divided into two single port RAMs, one for odd entries and another for even entries.
The address generation modes are defined as:
Mode one: Generate addresses for different radixr stages. Radix4, 2 and 3 need to read 4, 4 (tworadix2 handled in parallel), 3 data samples respectively for their computations.
This can be handled in several ways: Read data from memory 2 samples by 2 samples with 2 clock latency for each radix operation, double memory clock frequency and read 2 samples by 2 samples with 1 clock latency for each radix operation at the expense of double memory power, or use 4port memories. Each of the above techniques have drawbacks to different degrees like lower throughput, power or both. In [10] we proposed an address scheme to solve the above problem with conflictfree memory access. The scheme is contingent on partitioning the memory to 4 dualport memory banks as well as the specific way data is distributed between the banks. This guarantees that at any stage we have at most two accesses to the same memory bank.
Initially data is saved and distributed between memory banks to be ready for the first radix stage (radix4 or 3). As N = 2^{ x } × 3^{ y }, (x ≠ 1), if x is even (integer stages from radix4) the butterfly performs radix4 computations till the end then switches to perform radix3 stages. Else (if x odd) the butterfly performs $\left(\frac{x1}{2}\right)$ radix4 stages followed by radix2 then switches to perform radix3 stages. Switching to radix3 stages consumes a onetime additional stage to rearrange data in memory banks. At last radixr stage, radix output is saved in the same locations of radix inputs.
Mode two: Generate addresses for DCT/IDCT modes to arrange data in v_{ n } and V_{ k } order.
Each input sample is saved in two locations in memory and the data is imported 2 samples by 2 samples in order to reduce data transmission time. So, data is distributed in memory banks to prevent memory conflict.
Mode three: Generate addresses for other vector instructions like MAC. It generates two addresses for two data vectors or one data vector (two samples at time) inorder sequence or get them form memory. With start addresses and data length as an input parameters.
5 Engine programming
The embedded processor programming passes through three phases: Simulation, testing, and verification. We will discuss 1024point DFT, 8 × 8 DCT and 64 tap FIR as case studies.
5.1 Simulation
The goal of these simulations is to find best values of our design which are: the scale factor which we divide on each radixr stage, word length and coefficient factors length.
5.1.1 Scale factor
 (1)
Select the Highest SQNR.
 (2)
Guaranteed output RMS, to keep signal peaks which are needed in same applications.
The tool are designed by Matlab software, it generate all possible scale factors with corresponding signal to quantization noise (SQNR) and the RMS of the output then select the best scale vector Depending on the input mode. Using the first mode for our example reveals scale factors of (4 4 2 2 2) for 1024 point five stages, giving the highest SQNR with a Gaussian input.
5.1.2 Word and coefficient lengths
5.2 Testing
Code for the application is written using custom mnemonics that combine HWspecific instructions with applicationspecific instructions. This then passes through a assembly compiler (designed by Matlab software) which generates the bootloading and program object files.
When processing begins, the decoder accesses address zero in the boot loading ROM and reads initialization instructions. These instructions are mainly used for loading data and instructions from flash memory to the corresponding RAM memory in the system. Upon finishing, the decoder jumps to program memory and starts processing.
5.2.1 1024point DFT
8 × 8 DCT
5.2.2 FIR filter
5.3 Verification
Verification of these and other examples is through bitmatching the results of random input patterns with fixedpoint results from fixed point golden files. The golden files are verified and tested against a floating point model to make sure they perform the needed tasks. The golden files are used to verify the RTL design by generating test cases, both directed and random.
6 Implementation results and performance evaluation
6.1 Implementation
Synthesis results (with memories)
Up to 8K pointDFT 1D symbol 26 complex word length  

Technology  IBM 130 nm CMOS technology (6 layers) 
Volt  1.08 V 
Libraries  Gates libraries: Typical (55°c) 
Fast library(125°c) used for worst case conditions  
Memories library: (125°c)  
Number of Cells  57,906 cell 
Area  0.612 × 0.6 (0.36) mm^{2} 
Power  56 mw at 100 MHz 
Max frequency  700 MHz 
6.2 Performance evaluation
Number of clock cycles and SQNR for 1DDFT including data transfer times between the embedded engine and the host
Npoint DFT  Cycles per  Latency @ 100 MHz  SQNR Scale factor  Scale factor  

DFT  DFT  (dB)  (μ s)  s 1  s 2  s 3  s4  s 5  s 6  s 7 
64  146  1.46  83.67  4  2  2  
128  278  2.78  86.16  4  2  2  2  
256  470  4.7  96.839  4  4  2  2  
512  1002  10.02  96.37  4  4  2  2  2  
1024  1898  18.98  99.1  4  4  2  2  2  
2048  4222  42.22  98.97  4  4  2  2  2  2  
4096  8318  83.18  97.84  4  4  4  2  2  2  
8192  18578  185.78  95.25  4  4  4  2  2  2  2 
Number of clock cycles for 1DDCT including data transfer times between the embedded engine and the host
Npoint DCT  Cyles per DCT  Latency @ 100 MHz (μ s) 

64  177  1.77 
256  592  5.92 
512  1247  12.47 
1024  2399  23.99 
Number of clock cycles for 2DDCT including data transfer times between the embedded engine and the host
N × Npoint DCT  Cycles per DCT  Latency @ 100 MHz (μ s) 

8 × 8  186  1.86 
16 × 16  390  3.9 
32 × 32  1724  17.24 
64 × 64  6380  63.8 
Number of clock cycles and SQNR for 1DDFT including data transfer times between the embedded engine and the host
Parameters  

Reference  Implementation  Technology (nm)  volt (V)  Frequency (MHz)  Maxpoint DFT  Time to end (μ s)  Power (mW)  Normalized power  Power efficiency  SQNR (dB) 
[14]  Pipeline HW  350  1.5  25  1K  40.96  200  35.6  1.4  N/A 
[25]  Configurable HW  180  1.8  86  8K  805  75.51  19.6  15.8  N/A 
[18]  Configurable HW  180  1.8  200  8K  395  117  84.2  33  N/A 
[5]  Configurable HW  65  1.3  866  4K  7.1  35  48.3  0.3  71.90 
[26]  Configurable HW  180  1.8  150  8K  138  350  91  12.5  N/A 
[19]  Configurable HW  180  1.8  70  2K  224  140  36.4  8.15  N/A 
[2]  DSP      100  1K  403.3  N/A  N/A  N/A  N/A 
[20]  ASIP  250  2.5  100  4K  52.80  275  26.6  1.4  61.23 
[21]  ASIP  180  1.8  300  1K  13.8  N/A  N/A  N/A  N/A 
1K  18.98  19  19  0.3  99.1  
Proposed  ASIP  130  1.08  100  4K  42.2  25  25  1.05  97.84 
8K  185.7  56  56  10.3  95.25 
6.3 Discussion
Weidong and Wanhammar [14] proposed an pipeline ASIC for pipeline FFT processor. Here we prove our discussion in Section 3, the pipeline architecture have a higher throughput but loss on power efficiency.
The authors of [5, 18, 19] proposed memory based ApplicationSpecific Integrated Circuit (ASIC) for scalable DFT engine. The proposed engine in [5] enables runtime configuration of the DFT length, where the supported lengths vary only from 16points to 4096. while the proposed engine in [18] enables reconfigurable FFT Processor, the FFT lengths vary only from 128points to 8192. and [19] can perform 64 2048point FFT. This engines have high throughput rates. But, they only support certain kinds of algorithms for which they are designed.
In contrast, [2] used digital signal processors owing to their high reconfigurability and adaptive capabilities. Although DSP performance is improving, it is still unsuitable due to its high power consumption and low throughput. Hsu and Lin [2] proposed an approach for DFT implementation on DSP with lowmemory reference and high flexibility, however it is optimized for 2^{ x }point DFT, It needs 40,338 cycles to complete one 1024point DFT.
The third solution, [20, 21] is the ASIP which compromises between the above solutions. Zhong et al. [20] proposed an DFT/IDFT processor based on multiprocessor rings. This engine presents four processor rings (8, 16Point FFT) and supports DFT lengths from 16points to 4096. Guan et al. [21] proposed an ASIP scalable architecture of anypoint DFT at the expense of a large PE (contains an 8point butterfly). the authors present only the power consumption of functional unit and data address generator so we did not include it in the table.
Shah et al. [22] presents a pipelined scalable anypoint DFT 1D/2D engine which requires 256 clock cycles for (16 × 16)DFT 2D, while [23] and this design require 512 cycles. Nevertheless, Sohil Shah's proposal has higher area.
For DCT1D, we use the mathematical algorithm in [12] which implements ASIC DCT1D bulting blocks common with DFT. The engine has a throughput of one 512point DCT per 1,771 cycles, and one 1024point DFT per 3435 cycles.
For DCT2D existence designs, the engine in [24] has been tailored to a particular application needing 80 cycle for (8 × 8)DCT 2D, and programmable DSP [1] supports scalable (N × N)DCT 2D as N = 464. needs 2,538 cycles for (16 × 16)DCT 2D,

More power efficient than most of other proposed architectures in the literature.

Could be support many OFDM Systems with relatively low power.

High reconfigurability which allows users to program a very wide range of applications with softwarelike ease.

Support peripheral operations beside the main processes like CP remover which was need in the proposed WiMAX demo.

Simple interfaces (FIFO interface) which handle data transfer between the engine and asynchronous blocks with different clock domains.

The engine parameters like the number of bits, memories size and types are parameterized to meet different requirements and higher symbol lengths

A new address generation scheme allows reading and writing the butterfly data in one clock cycle which allow performing 1 butterfly operation each clock. This reduce processing time by 50% without doubling the clock frequency no loss on power.

The selection of radix4 algorithm which have best power efficiency.

Using HW accelerators accelerate the processing and reduce the complicity of the decoder.

Using pipeline processing of the vector instructions is also accelerate the processing.

Using simultaneously input and output data transformations with four data buses which reduce data transformations time by 75%.

Reduce time to market by supporting a compiler tool for the engine with a simple instruction set

The use of classified engines allows high degree of optimization.
7 Conclusion
Applications that can be supported and the corresponding estimated clock frequency
DFT1D applications  90 MHz  LTE, WIMAX, WLAN, DVBT, DVBT, DVBH, DAB, ADSLs and VDSL 

DCT2D applications  60 MHz  Low bit rate video conferencing, basic video telephony, interactive multimedia and digital TVNTSC 
Declarations
Acknowledgements
This study was part of a project supported by a grant from STDF, Egypt (Science and Technology Development Fund).
Authors’ Affiliations
References
 Liu X, Wang Y: Memory Access Reduction Method for efficient implementation of VectorRadix 2D fast cosine transform pruning on DSP. Proceedings of the IEEE SoutheastCon 2010, 6872.Google Scholar
 Hsu YP, Lin SY: Implementation of LowMemory Reference FFT on Digital Signal Processor. Journal of Computer Science 2008, 7: 545549.Google Scholar
 Frigo M, Johnson SG: The Design and Implementation of FFTW3. Proceedings of the IEEE 2005, 93: 216231.View ArticleGoogle Scholar
 Jo BG, Sunwoo MH: New continuousflow mixedradix (CFMR) FFT processor using novel inplace strategy. IEEE Transactions on Circuits and Systems 2005, 52(5):911919.MathSciNetView ArticleGoogle Scholar
 Jacobson AT, Truong DN, Baas BM: The Design of a Reconfigurable ContinuousFlow MixedRadix FFT Processor. IEEE International Symposium on Circuits and Systems ISCAS 2009, 11331136.Google Scholar
 Hangpei T, Deyuan G, Yian Z: Gaining Flexibility and Performance of Computing Using ApplicationSpecific Instructions and Reconfigurable Architecture. International Journal of Hybrid Information Technology 2009, 2: 324329.Google Scholar
 Poon ASY: An EnergyEfficient Reconfigurable Baseband Processor for Wireless Communications. (IEEE) Trans VLSI 2007, 15(3):319327.View ArticleGoogle Scholar
 Iacono DL, Zory J, Messina E, Piazzese N, Saia G, Bettinelli A: ASIP Architecture for MultiStandard Wireless Terminals. Design, Automation and Test in Europe (DATE '06) 2006, 2: 16.View ArticleGoogle Scholar
 Hassan HM, Shalash AF, Hamed HM: Design architecture of generic DFT/DCT 1D and 2D engine controlled by SW instructions. Asia Pacific Conference on Circuits and Systems APCCAS 2010 2010, 8487.View ArticleGoogle Scholar
 Hassan HM, Shalash AF, Mohamed K: FPGA Implementation of an ASIP for high throughput DFT/DCT 1D/2D engine. IEEE International Symposium on Circuits and Systems (ISCAS) 2011 2011, 12551258.View ArticleGoogle Scholar
 Cooley JW, Tukey JW: An Algorithm for Machine Computation of Complex Fourier Series. Mathematics of Computation 1965, 19: 297301.MATHMathSciNetView ArticleGoogle Scholar
 Nguyen T, Koilpillai RD: The theory and Design of Aribitrarylength cosinemodulated filter Banks and wavelets, satisfying perfect reconstruction. IEEE Transaction on signal processing 1996, 44(3):473483.View ArticleGoogle Scholar
 Braganza S, Leeser M: The 1D Discrete Cosine Transform for Large Point Sizes Implemented on Reconfigurable Hardware. IEEE International Conference on Applicationspecific Systems, Architectures and Processors ASAP 2007, 101106.Google Scholar
 Weidong Li, Wanhammar L: A PIPELINE FFT PROCESSOR. IEEE Workshop on Signal Processing Systems, 1999. SiPS 99 1999, 19: 654662.Google Scholar
 Chidambaram R, Leuken RV, Quax M, Held I, Huisken J: A multistandard FFT processor for wireless systemonchip implementations. Proc International Symposium on Circuits and Systems 2006, 47.Google Scholar
 He S, Torkelson M: Design and Implementation of a 1024point Pipeline FFT Processor. Proceedings of the IEEE 1998 Custom Integrated Circuits Conference 1998, 131134.Google Scholar
 Lin JM, Yu HY, Wu YJ, Ma HP: A Power Efficient Baseband Engine for Multiuser Mobile MIMOOFDMA Communications. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI 2010, 57: 17791792.MathSciNetView ArticleGoogle Scholar
 Sung TY, Hsin HC, Ko LT: Reconfigurable VLSI Architecture for FFT Processor. WSEAS TRANSACTIONS on CIRCUITS and SYSTEMS 2009., 8:Google Scholar
 Lee YH, Yu TH, Huang KK, Wu AY: Rapid IP Design of Variablelength CachedFFT Processor for OFDMbased Communication Systems. IEEE Workshop on Signal Processing Systems Design and Implementation, 2006. SIPS '06 2006, 6265.View ArticleGoogle Scholar
 Zhong G, Xu F, Willson AN Jr: A powerscalable reconfigurable FFT/IFFT IC based on a multiprocessorring. IEEE Journal of SolidState Circuits (JSSC) 2006, 41: 483495.View ArticleGoogle Scholar
 Guan X, Lin H, Fei Y: Design of an Applicationspecific Instruction Set Processor for Highthroughput and Scalable FFT. IEEE International Symposium on Circuits and Systems ISCAS 2009, 25132516.Google Scholar
 Shah S, Venkatesan P, Sundar D, Kannan M: Low Latency, High Throughput, and Less Complex VLSI Architecture for 2DDFT. International Conference on Signal Processing, Communications and Networking ICSCN 2008, 349353.Google Scholar
 Shah S, Venkatesan P, Sundar D, Kannan M: A Fingerprint Recognition Algorithm Using PhaseBasedImage Matching for LowQuality Fingerprints. IEEE International Conference on the Image Processing 2005, 3336.Google Scholar
 Tumeo A, Monchiero M, Palermo G, Ferrandi F, Sciuto D: A Pipelined Fast 2DDCT Accelerator for FPGAbased SoCs. IEEE Computer Society Annual Symposium on VLSI 2007, 331336.View ArticleGoogle Scholar
 Cho YJ, Yu CL, Yu TH, Zhan CZ, Wu AYA: Efficient Fast Fourier Transform Processor Design for DVBH System. proc VLSI/CAD symposium 2007.Google Scholar
 sung TY: Memoryefficient and highspeed splitradix FFT/IFFT processor based on pipeline CORDIC rotations. IEEE proceedings, Image Signal Process 2006, 153: 405410.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.