Embedded reconfigurable synchronization & acquisition ASIP for a multistandard OFDM receiver
 Mahmoud A Said^{1}Email author,
 Omar A Nasr^{1} and
 Ahmed F Shalash^{1}
DOI: 10.1186/1687396320122
© Said et al; licensee Springer. 2012
Received: 15 July 2011
Accepted: 26 March 2012
Published: 26 March 2012
Abstract
Embedded reconfigurable architectures are currently attracting increasing attention in the wireless communications industry due to the escalating number of wireless standards in today's market. Application specific instructionset processors (ASIPs) present a reconfigurable solution that offers a compromise between programmability and low power consumption. In this article, the design and implementation of an embedded synchronization and acquisition ASIP for OFDM based systems is proposed. The engine architecture is presented and the programming model is explained in details. The proposed engine is scalable and it can be configured to support a multitude of synchronization algorithms and OFDM standards. While applicable to many OFDM systems, the proposed architecture was successfully verified on long term evolution (LTE Rel. 8) and WiMAX 802.16e systems. A partial list of synchronization and acquisition algorithms are tested on the engine for the two standards, and the results highlight the capabilities of the engine. The processor has been synthesized with 0.18μ m standard cell CMOS library. It is estimated to occupy 1.1 mm^{2} and the projected power consumption is 7.9mW at 120 MHz, which meets the speed requirements of the tested standards. More results are included within the article.
Keywords
reconfigurable ASIP embedded processors baseband processors lowpower design OFDM synchronization1 Introduction
Contemporary wireless standards allow for the radio to have connectivity with more than one technology at the same time. For example, the radio can be connected to a WiFi hotspot when a signal exists, or to a WiMAX base station if the WiFi signal is weak or does not exist. The ability to connect to more than one technology increases the reliability and the use of the radio's connectivity. It also enables applications that require constant connectivity, such as remote health care and remote industrial automation, which cannot tolerate any loss of connectivity at any time. Moreover, there are still competition, enhancements, regional variants, and new versions of the wireless standards that emerge with time. For example, in the field of 4G and beyond, the marginal competing standards and the need to have an easy migration path between different systems increases the need for configurability without sacrificing throughput, area or power consumption. This line of thinking gave a boost to the concept of the software defined radio (SDR) [1]. SDR, in general, is based on general purpose digital signal processors (DSPs). Thus, it suffers from limitations in throughput and power consumption. However, the need for programmability and configurability is unabated due to the proliferation of wireless standards. Another approach to achieve configurability without sacrificing power consumption is to use application specific instructionset processors (ASIP) [2–4]. In ASIP technology, a core unit is programmed using a specific instruction set that configures the core unit to perform multiple functionalities.
For most of today's and emerging standards, orthogonal frequency division multiplexing (OFDM) was the modulation scheme of choice in systems such as high performance LAN type 2 (HIPERLAN/2) [5], IEEE 802.11a [6], IEEE 802.16 family [7] and 3GPP long term evolution (LTE). OFDM's main advantage is its ability to alleviate the intersymbol interference (ISI) caused by multipath fading channels, even for large
channel delay spreads. Hence, at the receiver, there is no need to design complex channel equalizers, which reduces the complexity and the power consumption of the receiver. On the other hand, OFDM systems are very sensitive to synchronization errors [8]. Therefore, there is a need to design, and efficiently implement high accuracy synchronization algorithms using embedded reconfigurable engines that can support the increasing number of OFDMbased standards. The concept of reconfigurable engines for wireless applications has been previously explored in the literature. Configurable radio architectures that can support multiple standards were proposed in [9, 10], where the engine core consists of an array of reconfigurable units. Poon [11] uses five different configurable units to perform all tasks for the digital part of the radio. Application specific processor architecture was proposed in [4] for OFDM channel estimation. In our previous study [3], an ASIP architecture is proposed to support synchronization tasks in OFDM systems. In [3], we proposed only the architecture of reconfigurable engine architecture to achieve a compromise between powerful dedicated hardware implementations and very flexible general DSP processors, but with limited programming capabilities.
Expanding on [3], an embedded reconfigurable ASIPbased engine that can efficiently carry out OFDM synchronization and acquisition tasks is presented. The main building block of the engine is a core unit that was designed to efficiently carry out synchronization tasks. The core unit can be programmed with a special instruction set to optimize the usage of the hardware resources. Memories for data and instructions, registers for intermediate data storage, and an instruction decoder are all parts of the the engine. The engine and the instruction set are optimized for vector instructions, which are frequently used in synchronization and acquisition algorithms. The results show that the hardware multiplexing in this ASIP solution reaches a smaller implementation area than the solution of multiple dedicated implementations. In addition, it allows a higher degree of hardware reuse between different algorithms in different standards.
The organization of the article is as follows: Section 2 introduces the OFDM system model. The detailed engine architecture is proposed in Section 3. Section 4 discusses the algorithm selection and analysis of the processing tasks, while the programming model is discussed in Section 5. Results of the proposed engine are presented in Section 6. Section 7 concludes the article.
2 OFDM system model
where H_{ l,k } is the channel frequency response at subcarrier k in symbol l and n_{ l }_{,k}is the additive noise samples at subcarrier k in symbol l.
Tracking the variations of the CFO and SCFO is critical in OFDM systems due to their sensitivity to frequency offsets. SCFO and the residual part of the CFO (RCFO) are estimated and corrected in a tracking phase.
 1.
Acquisition phase. Four processes are performed in this phase: symbol timing (frame boundary detection), initial fractional CFO (FCFO) estimation, cellsearch (CS) and ICFO estimation. Correction of the estimated errors is shown in Figure 2
 2.
Tracking phase. In this phase RCFO and SCFO are estimated and corrected.
From the implementation point of view, a significant amount of baseband processing takes place in the synchronization subsystem. Optimized architectures that fulfill the needs of the synchronization subsystem with a high degree of configurability will have the advantage in terms of area and power.
3 Design of the proposed engine
3.1 Engine architecture
3.1.1 Configurable unit (CU)
Major CU supported operation
CUconfiguration  Resulting operation 

CMAC  Autocorrelation 
Crosscorrelation  
Euclidean distance calculation  
Vector complex multiplication  
Real multiplyadd  ab + cd 
Controlled CACC  BPSK preamble correlation 
CORDIC  Vectoring mode 
Rotation mode  
Maximum likelihood  Online comparison 
The main operation of the CU is the Complex Multiply ACcumulate (CMAC). Mathematically, it can be implemented in two ways, using either three or four real multipliers. To limit the number of multipliers, three multipliers are used, although five adders are required as opposed to four adders in the four multipliers scheme. An extra adder is added to allow the summation of eight different real inputs or four complex inputs. In addition to the two large accumulators, the CU uses internal multiplexers to configure the running operation according to the control vector (CV).
The CU consists of six 12bit real adders, three 13bit real multipliers followed by two 12bit rounders, two 24bit accumulators, two two's complement operations, ten 13bit multiplexers (MUX) and two 24bit shifters. The eight ports (I1 ... I8) in Figure 4 are intended for operations on real data while the six ports (I1 ... I6) can be used alone to implement the complex multiplication process. Real ADD/SUB operations are executed with two different precisions (12bit and 24bit).
The CU is optimized by pipelining into three pipeline stages. The first stage is an addition stage used for normal and vector complex multiplications. This adds the benefit of having a first stage capable of adding eight real numbers before passing its output to the next addition stage (stage 3). The second stage is the multiplication stage. It has only one multiplier between two registers to minimize the critical path of the overall unit. The third stage is the second addition stage like the first stage but it has only two adders instead of four.
One cycle of latency is achieved when pipelining the CU in normal instructions. Vector instructions are executed on a time multiplexing manner on the CU with a maximum vector length of 256 elements. Among different supported operations, the controlled accumulation (CACC) operation needs the largest number of simultaneous complex input signals. CACC operation adds or subtracts four complex words every cycle. This puts a constraint on the memory system to supply the unit with a maximum of four words every cycle. However in this mode, no write operations can be executed. To work in CACC mode, hardware configuration of the CU with respect to the control vector is done. M1, M2, and M3 multipliers of Figure 4 are bypassed while a running configuration of the Add/Sub operations (A1 ... A8) are controlled via a stored control sequence.
Time multiplexing of operations running on the engine core limits its multiprocess/cycle capabilities. A maximum of one operation/cycle can be executed on the engine core, no matter wither this operation is simple like addition or computationally complex like complexmultiplications. Although the architecture has one CU, the engine is scalable via adding multiple CU units connected with each other by the two ports I9 and I 10 to support larger systems.
3.1.2 Memory system
Memory is divided into two 286 word dualport banks (24bit). Memory size is dominated by the maximum supported correlation length of 256 in addition to the free space needed to store any internal outputs. The choice of the maximum correlation length was based on the required performance in 802.16e and 3GPPLTE release 8. Inputs to the memory system is connected to a Memory Input Generator in Figure 3, which is controlled by the instruction decoder. Memory could accept inputs from the external ports, Register File 2, main CU output or the memory itself in a MOV operation. Memory controller handles the write operations and prevent any racing conditions. The two banks are running on the same operating frequency of the core unit. No special addressing modes are required, and hence, address generators are basically counters. Time sharing between different tasks running on the processor allowed further optimization in the memory system by increasing the memory reuse option.
Two general purpose register files, Register File 1 and Register File 2, in Figure 3 are used with a register input generator controlled by the instruction decoder directly. Register File 1 is of size 12bit and holds 8 general purpose registers to facilitate data flow operations, counting, set outputs and many other useful operations. Register File 2 is of size 24bit but it consists of four general purpose registers only. The first advantage of them comes when dealing with movements of complex data inside the engine. These optimization methods beyond the traditional one fixed size register file allows faster execution of real and complex data operations. For example, moving a complex word from the memory system as two (real, imaginary) parts would take double latency beside the complexity in dealing with the two parts as a one word in the executed program.
Reference correlation sequences are stored in 3072 byte ROM for the both of 802.16e and LTE release 8. The ROM takes its address directly from the instruction decoder with an internal counter for its address only. All data can be transferred between different parts of the engine through a data bus of four complex words maximum.
3.1.3 Input/output interface
The engine interfaces with the outer world through a set of input control signals (IC) and output control signals (OC) beside two external ports for data transfers. Four IC signals are connected directly with the instruction decoder and used for acknowledgment about a certain event. Another four OC signals output from the controller of the output updater to identify the state of the engine at any stage. The two external data transfer ports are 24bit wide each (12bit real, 12bit imaginary). One port is connected to the time domain side, while the other port is connected after the FFT operation. Any read operation is carried out through one of these two ports and with the two addresses external read address 1 and external read address 2. The accuracy of the chosen number of bits is verified in Section 6.
The output updater holds the same output value on the same port, until a control signal comes from the instruction decoder to its controller to update the output with a newer value in a certain register. The output registers are general registers used to set any value as an output. Here, we give output registers restricted names to clarify the engine operation. The five output registers holds the starting address, FCFO, CELLID, ICFO, and SCFO. All of the five registers are 12bit each. Typically, Starting Address goes to the input buffer that holds the FFT window to identify the first sample in the incoming frame. The estimated FCFO is used by the CFO correction complexmultiplier to derotate the input samples. CELLID is transferred to higher layers. ICFO is added to the fractional part of the CFO to guarantee correct reception with time. Estimated value of SCFO is considered as the seed for the ROB/STUFF correction algorithm in [13].
While transferring data from any port to the internal memory banks, no execution of any other instruction is carried out. This control mechanism is achieved when the controller holds the instruction inside the instruction register by reentering the same instruction to the instruction register till the end of the transfer process. The same mechanism works for vector instructions, where the controller reenter the vector instruction to the instruction register till the end of the execution phase.
3.1.4 CORDIC algorithm
3.2 Engine programming
Programming of the proposed embedded ASIP includes three types of instructions. The first type is the ordinary classes like program flow instructions (conditional and unconditional jumps), move instructions, real, and complex ADD/SUB instructions, interfacing control instructions (external reads, output set). The second type is optimized instructions to facilitate the implementation of synchronization subsystem tasks as well as other algorithms in different parts of the engine. The third type is vector instructions.
Ordinary instructions operate on single data points stored in registers (RF1 & RF2), and the result is automatically stored in another register. Most of instructions of this type take one cycle to complete. All control instructions belong to this simple class of instructions. Optimized instructions are special instructions for special purposes like the ANGLE instruction and the BPCACC (discussed later). This kind of instructions operates on single point or vector of complex numbers stored either in the memory like BPCACC or in registers like ANGLE, and the result stored also in either memory or registers. Execution of these instructions always consist of multiple execution stages. Vector instructions operate on vectors of complex numbers stored in memories. The output is either stored in another memory if there is no accumulation associated with it, or in a register from register file 2 if there is an accumulation. The number of cycles needed for vector instructions depend on the vector length.
Execution of three operations of different types
Operation  Inputs source  DM1, DM2  M1, M2, M3  A1,..., A6  A7, A8  CORDIC control  CORDIC LUT  shifters  ROMS  output updater 

CMAC  Data path (DM)  active  active  active  active  off  inactive  inactive  inactive  inactive 
Angle  Register file 2  inactive  bypassed  bypassed  active  on  active  active  inactive  inactive 
Read  Data path (External)  active  inactive  inactive  inactive  off  inactive  inactive  inactive  inactive 
The CORDIC subroutine is executed when fetching the ANGLE instruction (special type) in 18 cycles for a precision of 12bits (as stated before). Shifters included with the engine core are not general purpose shifters that accept arbitrary inputs; they are used only in the execution of the iterative CORDIC algorithm and are controlled by a counter attached with the instruction decoder.
Although most of the instructions are either control instructions or instructions that operate on single data (ADD, SUB,...), the processor operates most of the time on vector data. Hence, the processor is optimized for operations on vectors of complex data or specialized operations associated with many supported tasks. The assembly program becomes relatively long for control or single data instructions compared to what it fulfills.
The engine is programmed via a script (contain the entire program) enters a primitive compiler. The compiler outputs a (.hex and.mif) memory initialization file for the program memory. The output file containing the binary vector is downloaded into the program memory to begin the program fetching.
4 Example algorithms analysis
 1.
Obtain coarse frame boundaries with a packet detector.
 2.
Initiate a search over the samples selected from step (1) to obtain fine frame boundaries. This step with step (1) are called Frame boundary determination (FBD).
 3.
With the first sample in the frame known, estimate and correct fractional CFO.
 4.
After FFT, estimate the ICFO and update the value of the frequency of offset estimated in step (3).
 5.
For OFDM cellular standards, estimate the CELLID from the given reference sequence.
 6.
Track and update the residual CFO (RCFO) and SCFO with the nonpreamble OFDM symbols.
For testing and evaluation purposes of the proposed architecture, high level Matlab floating/fixed point models of IEEE802.16e and LTE Rel. 8 have been created to apply the chosen algorithms in [8, 13–15]. The proposed embedded ASIP can be configured to implement the chosen synchronization algorithms. It is capable of supporting not only the chosen algorithms but it can support other algorithms as well.
4.1 Packet detection
Detection of the correct boundaries of the incoming OFDM packets has a major impact on the performance of all post FFT subsystems. Therefore, good timing synchronization algorithms in the acquisition stage will allow early locking on the incoming signal. Using correlationbased algorithms as a detection method directly as in [16] and [17] will cost more energy in the ideal state (no transmitted signals). On the other hand, using a simple, but not accurate detection method, like single or double sliding window (DSW) algorithms [18] to detect the packet and then refine the estimate using correlations will cost us lower energy.
where a_{ n }, b_{ n }, and M, L are the energies and sizes of window A and window B, respectively.
4.2 Symbol timing
The boundaries of the search window come from the packet detector. To prevent intersymbol interference, a reasonable shift inside the cyclic prefix is done. Hence, if the detector gives an estimate for the first sample in the symbol at m, the first correlation window begins at ms, where s denotes the safety shift back. The correlation window slides over time till a search size of 2s + 1 are evaluated. This implies that the maximum absolute value between 2s + 1 output correlation result corresponds to the maximum likelihood starting sample. This algorithm has proven its robustness against multipath fading channels, besides the advantage of being unaffected by the received power level. Only the number of samples L contributing in the cyclic prefix (CP) correlation is affecting the performance of the estimator.
The proposed embedded engine is capable of supporting a maximum correlation length of 256 samples. This maximum is chosen with respect to the required accuracy. This maximum is justified by the comparison between floating point results and the proposed engine results in Section 6. The maximum length of the contributing samples is 256 and can be scaled easily with respect to the required performance. The most complex operation here is the complex multiplication in Equation (8). Autocorrelation and Euclidean distance (ED) calculation (energy) in Equation (8) can be realized by the standard complex multiplyaccumulate (CMAC) unit shown in Figure 7a Subtraction of the output of ED from the output of CP autocorrelation is performed online by controlling the accumulation sign. Adders in accumulators are two's complement Add/Sub modules controlled by the processor control unit. The last step is the maximum absolute search between the 2s + 1 evaluated result with the ML unit shown in Figure 7b
4.3 Fractional CFO estimation
where T is the estimated starting sample index from FBD, θ is the starting point of the correlation window and L is the cyclic prefix length. The reason for not starting the correlation window from the beginning of the cyclic prefix is the multipath fading channel delay spread τ effect on the estimation performance. The value of θ is chosen such that θ > τ, so that the channel effect is the same in the two parts of the correlation and the output is affected only by the added white noise.
Equation (11) can simply be executed on the same complex multiplyaccumulate (CMAC) unit used for FBD in Figure 7a Memory requirements here depend on the selected correlation window length with a maximum of 256 samples as stated before. The only difference is the calculation of the correlation angle before multiplying it by a constant. Angle estimation is carried out using the iterative CORDIC algorithm [12]. More details about the implementation of this algorithm were described in Section 3.
4.4 Joint ICFO & CS estimation

Q (k  1) : is the first nonzero subcarrier before k.

P_{ j } (k) : is the stored reference sequence of index (j).

$\phantom{\rule{2.77695pt}{0ex}}{D}_{j}\left(k\right)={P}_{j}\left(k\right).{P}_{j}^{*}\left(k1\right)$: is the autocorrelation result of the stored sequences P_{ j }.

I : is the integral frequency offset normalized to the subcarrier spacing.
For correlation purposes, shifted versions of P_{ j } must also be stored. For example, if we have a maximum ICFO of I_{ m }, [I_{ m }, I_{ m }], we must store 2I_{ m } + 1 version from each correlation sequence P_{ j }.
The complexity of this method is acceptable and can be implemented easily on the proposed embedded ASIP, noting that D_{ j } (k + I) in Equation (13) does not have to be calculated on the fly, but can be calculated in advance and stored in the receiver memory. The calculation of ℜ{Q (k). Q^{ * } (k  1)} for k = 0, 1, ..., N_{ p }  1 in Equation (13) requires an N_{ p } complex multiplications. The binary nature of D_{ j } (k + i) in Equation (13) makes the remaining computations needed to obtain ${M}_{I}^{I,j}$ Binary correlations can be performed using the controlled accumulation (CACC) unit shown in Figure 7c Accumulation sign is controlled via stored (shifted and normalized) reference sequence. The constraint on the defined maximum possible shift I_{ m } comes from the symbol duration and the available cycle budget. For example, in IEEE 802.16e, for a maximum ICFO of nine subcarriers (in the range [9,9]) we need to evaluate 722 [20] different correlation outputs to choose the maximum absolute value as the correct estimate. Comparison between the evaluated correlation results is done on the fly after every new correlation output. Further optimizations of this processing type is done in the design of processor computational core. The engine is optimized not only for this algorithm, but for many other algorithms as well.
4.5 Joint RCFO & SCFO estimation
To obtain Equation (15), some terms were neglected. In reality these terms are not neglected and will cause ICI that is measured and stated in the engine results in Section 6.
The complexity of this algorithm on the proposed embedded engine is dominated by the vector complex conjugate multiplication of the received pilot pattern every symbol and the measurement of pilots angles in Equation (16) using the CORDIC Algorithm [12]. These measured angles are accumulated in two manners: normal accumulation, real multiply accumulate in Figure 7d The memory used to same successive pilot patterns plus the resulting vectors after the conjugate multiplication is relatively large and considered well in the design of the memory system.
5 Algorithm programming on the engine
5.1 Packet detection
The engine was programmed to run the packet detection algorithm, where a new sample read operation is issued every t_{ s } (one sample duration). In order to implement the packet detection algorithm with the time requirements of 802.16e and LTE release 8, a down sampling of the received signal by a factor of 5 is required. Simulation results showed that there is no significant performance loss due to the down sampling needed for the execution of this task on the engine core. The whole packet detection program executes in 31 cycles and repeated with the next new sample. Once a packet is detected a detection control signal will rise to begin the symbol timing procedure.
5.2 Symbol timing
The symbol timing algorithm begins with a read operation of the two parts of the cyclic prefix from External Port 1. In the read operation, we identify the destination memory bank and the length of the read vector (Ex: READ P 1 : DM 1,276). The core unit is configured to run the autocorrelation operation through a CMAC instruction with a conjugate flag set on the second input. The output is stored in a register from Register File 2 till the energy (Euclidean distance EDACC) contained in the first and the second part of the cyclic prefix are evaluated. A 24bit subtraction operation (LADD) between the stored result and the resulting energy will give the first correlation output. Till now this is considered the only and the largest result, so its index is stored in a register from Register File 1.
Beginning the correlation in Equation 8 from scratch every time is not practical and consumes more energy. The next correlation output can be extracted directly from the estimated correlation output by setting the memory address step by 1 and repeat what is done for the evaluation of the first correlation output. The correlation length this time is not the whole cyclic prefix length but only a single data point. This iterative approach reduces the required execution energy as well as the program length. Once we get a new correlation result, comparison between the stored largest correlation result and the new result is executed. If the new result is larger, an update is done for both the stored value and the index of the largest. A time locking control signal is flagged from the engine when the whole 2 * s + 1 results are evaluated. The stored index, which corresponds to the largest correlation result, is the correct start location. The output Starting Address is updated via the output updater by a SET instruction with the stored value in Register File 1.
5.3 Fractional carrier frequency offset estimation
At this stage, the cyclic prefix is still stored in DM1 and DM2. So, no read operation is issued and the correlation begins directly with the known estimated index from FBD. The correlation output then passes by the CORDIC algorithm using the ANGLE instruction to estimate the output phase. The output of the ANGLE instruction is stored in a register from Register File 1. According to Equation (11), FCFO is estimated from the output phase by a constant multiplication by $\frac{1}{2\pi N{t}_{s}}$. The update on the output FCFO register is carried out via the output updater with the same SET instruction.
5.4 Cellsearch & integral carrier frequency offset
With the existence of an ICFO, the number of correlations needed to identify the transmitter (CELLID) in modern cellular networks can be very large due to the large number of reference sequences associated with each standard (114 for 802.16e, 504 for LTE release 8). A special instruction, called BPCACC, is used for the evaluation of a correlation with a binary sequence. The BPCACC instruction is capable of evaluating a binary correlation of length N_{ c } in $\frac{{N}_{c}}{4}+1$ clock cycles. The core unit should have 4 new complex numbers every cycle in the execution of the BPCACC instruction.
The symbol number of the received reference symbol is known at the receiver. Separation of this symbol is done after the FFT, as well as the removal of the guard bands. A read operation from External Port 2 is issued to store the received reference symbol in DM1. To read four successive complex samples from the received reference symbol, a copy of the received reference symbol is stored in DM2. Memory step registers are set to four, so that each port from the four ports of the two memory banks will provide the engine by a different complex data sample every cycle. This allows the memory system to output four consecutive complex words each cycle. We use DM1 to get the differential signal in, which is of length N_{ c }, and store it in DM2. Then, a copy of the contents of DM2 is moved again to DM1. The core unit is configured to perform the binary correlation by adding four complex numbers together with the ADD/SUB signals controlled by the correlation sequence. Every combination of 4bits from the correlation sequence correspond to a combination of 8 ADD/SUB signals to control the operation of the adders A 1 to A 8.
Every new correlation result is compared with the maximum previous result and the index of the maximum correlation output is stored in a register from Register File 1. This index corresponds to the correct ICFO and the attached CELLID. The final step in the aquistion phase is updating the values of ICFO and CELLID output registers.
5.5 Joint RCFO & SCFO
6 Performance evaluation
The performance of the proposed engine is measured against floating point Matlab model to assess its accuracy and the effect of roundoff errors. The implementation efficiency is projected for the resulting core area and power consumption to implement the supported operations.
Bitaccurate fixed point simulations are used for functional verification, the generation of test vectors and building verification suites. The word length chosen for the execution unit is determined by the maximum precision needed in any algorithm. FCFO estimation needs 24bit word, which turned out to be the largest number needed. To verify the accuracy of the chosen number of bits, comparison between floating point (FP) results and engine results for FCFO estimation (normalized by subcarrier spacing) is shown in Figure 11 with a correlation length of 96 and 60 in WiMAX and LTE, respectively.
The Altera Stratix III FPGA kit is used to functionally verify the proposed design while Synopsys Design Compiler is used to estimate the chip area and the design static power consumption (ASIC design). The engine is coded using Verilog HDL, which is compatible with Synopsys Design Compiler. MentorGraphics Modelsim was used for functional simulations.
First the design was synthesized with the Altera Quartus II and programmed on a Stratix III (Stratix III EP3SC150 FPGA kit) FPGA to verify the design functionality. Then, The processor was synthesized in a 0.18μ m CMOS process at a voltage of 1.8 V using Synopsys Design Compiler. The engine, without the memory, is estimated to occupy 1.1mm^{2} and is estimated to consume an average static power of 7.9 mW when running at a speed of 120 MHz.
The proposed engine's control overhead is less than 10% of the total processing cycles. Pipelined processing of data is interrupted mainly by CORDIC subroutine in an average of 4820 cycles/symbol. A total of 95 MIPS are supported @ 120 MHz.
 1.
The use of optimized instruction set. This removed many control overheads and allowed faster executions.
 2.
Memory architecture that reduces memory interactions, even with complex vector instructions.
 3.
The mechanism of data movement to/from the CU and the memories.
 4.
Grouping of all units and increasing the degree of hardware reuse.
 5.
No cache memory is used.
Resulting cycle budget for different tasks
Processing Task  # Cycles (WiMAX)  Latency @ 120 MHz (μs)  # Cycles (LTE)  Latency @ 120 MHz (μs) 

FBD  894/frame  7.45  606/frame  5.05 
FCFO  214/frame  1.78  142/frame  1.16 
CS # ICFO  10830/frame  90.25  3520/frame  29.33 
RCFO  5288/symbol  44.117  4408/symbol  36.73 
SCFO  5284/symbol  44.08  4404/symbol  36.70 
Parameters in IEEE 802.16e, LTE Rel. 8 (PSS, SSS are primary and secondary synchronizationsymbols)
parameter  IEEE 802.16e  LTE Rel. 8 

Useful symbol time (μs)  91.428  66.667 
Subcarrier spacing (KHz)  10.9  15 
Max. CP length (samples)  256 (22.85 μs)  160 (10.41 μs) 
Max. number of pilots  240  200 
Max. preamble length  2048  72 (PSS), 72 (SSS) 
Comparison between dedicated implementations and engine results in WiMAX
Process  # Gates (k)  # Mul.  # Cycles  Max. Freq. (MHz) 

FBD  49  1  21  147 
FCFO  18  1  142  161 
CS # ICFO  251  4  5112  107 
RCFO  54  2  2644  145 
SCFO  54  2  2640  145 
Proposed  118  3  Table 4  159 
Synthesis results
FPGA type  Altera Stratix III EP3SC150 

Total ALUT  752/113,600 (< 1%) 
DSP blocks (18bit)  3 
Dedicated logic registers  262/113,600 (< 1%) 
Block RAM  54.4844/5499 Kbit (< 1%) 
Max. clock frequency  159MHz 
We programmed the processor using one complete program that is executed at the beginning of every new frame. Otherwise, only the packet detection part of the code is executed till the beginning of an incoming frame. Tracking algorithms have proven to be the most power consuming as it is executed every symbol, unlike the acquisition algorithms that executes only at the beginning of reception.
To fully utilize the capabilities of the proposed engine, it is recommended to maximize the utilization of operations that have a fast execution phase in the instruction set. This requires understanding the proposed architecture with the attached instruction set when developing the algorithms. The assembly programmer should take care of various parameters that make the execution faster. For example, if pilot subcarriers are arranged at the same indices over time, there is no need to use special arrangements like the one shown in Figure 10 Normal arrangement of pilots beside each other will reduce the complexity when executing the rest of RCFO & SCFO estimation algorithm.
The engine is verified by running all the synchronization subsystem tasks in 802.16e and LTE release 8. It meets the timing requirements @ 120 MHz, while the maximum operating frequency is 149 MHz. To support larger systems in a scalable way, more than one CU can be inserted and controlled with the same control vector. In this case, accumulation outputs O 3, O 4 in Figure 4 are fed directly to another unit through the two input ports I 9, I 10.
7 Conclusion
In this article, a scalable embedded reconfigurable baseband ASIP for OFDM synchronization subsystem has been proposed. The processor can support a multitude of OFDMbased standards. Although the engine is optimized for OFDM synchronization purposes through detailed analysis of synchronization tasks in the different OFDMbased standards, it also offers a high degree of flexibility to support other simple and vector operations. Area, power and hardware complexity are reduced through reconfiguration of a single unit to support multiple special operations optimized for synchronization subsystem. The processor was successfully tested on IEEE 802.16e and 3GPP LTE Rel. 8 standards. Synthesis results show that it is efficient in terms of throughput, area and power consumption.
Declarations
Acknowledgements
The authors would like to thank H.A.H. Fahmy, K. Osama, and H. Hamed for their invaluable comments while preparing this article.
Authors’ Affiliations
References
 Glossner J, Iancu D, Jin L, Hokenek E, Moudgill M: A softwaredefined communications baseband design. IEEE Commun Mag 2003, 41: 120128.View ArticleGoogle Scholar
 Vogt N, Wehn T: A Reconfigurable ASIP for convolutional and turbo decoding in an SDR environment. IEEE Trans Very Large Scale Integrat (VLSI) Syst 2008, 16: 13091320.View ArticleGoogle Scholar
 Abdelall M, Shalash AF, Fahmy HAH: A reconfigurable baseband processor for wireless OFDM synchronization subsystem. IEEE Int Symp Circ Syst 2011.Google Scholar
 Azar C, Ojail M, Chevobbe S, David R: CERA: a channel estimation reconfigurable architecture. IEEE Int Conf Telecommun ICT 2010, 17: 957964.Google Scholar
 ETSI Broadband radio access networks (BRAN), Hiperlan type2; physical (PHY) layer, ETSI BRAN, Technical Report. 2000, 101: 475.Google Scholar
 IEEE802.11, Wireless LAN Medium Access Control(MAC) and Physical Layer (PHY) specification: HighSpeed Physical Layer in the 5GHz Band, IEEE Std 802.11a1999, IEEE Computer Society. 2000.Google Scholar
 IEEE, (IEEE) Standard for Local and metropolitan area networks, Part 16: Air Interface for Fixed and Mobile Broadband Wireless Access Systems, IEEE 802.16e2005 and IEEE 802.162004/Corl2005. 2006.Google Scholar
 Speth M, Fechtel SA, Fock G, Meyr H: Optimum receiver design for wireless broadband systems using OFDM: Part I. IEEE Trans Commun 1999, 47: 16681677. 10.1109/26.803501View ArticleGoogle Scholar
 Mei B, Lambrechts A, Verkest D: Architecture exploration for a reconfigurable architecture template. IEEE Des Test Comput 2005, 22: 90101. 10.1109/MDT.2005.27View ArticleGoogle Scholar
 Ebeling C, Fisher C, Xing G, Shen M, Liu H: Implementing an OFDM receiver on the RaPiD reconfigurable architecture. IEEE Trans Signal Process 2004, 53: 14361448.Google Scholar
 Poon ASY: An energyefficient reconfigurable baseband processor for wireless communications. IEEE Trans VLSI 2007, 15: 319327.View ArticleGoogle Scholar
 Dawid H, Meyr H: CORDIC algorithms and architectures. Digital Signal Process Multimedia Syst 1999, 2: 623655.Google Scholar
 Wu JM, Chou CH: Baseband Sampling Clock Frequency Synchronization for WiMAX Systems, Institute of Communications Engineering National Tsing Hua University Hsinchu. 2005.Google Scholar
 van de Beek JJ, Sandell M, Borjesson PO: ML Estimation of Time and Frequency Offset in OFDM Systems. IEEE Trans Signal Process 1997, 45: 18001805. 10.1109/78.599949MATHView ArticleGoogle Scholar
 Speth M, Fechtel SA, Fock G, Meyr H: Optimum receiver design for OFDMbased broadband transmission. II A case study IEEE Trans Commun 2001, 49: 571578.View ArticleGoogle Scholar
 Bhatt T, Sundaramurthy V, Zhang JC, McCain D: Initial Synchronization for 802.16e Downlink,Signals, Asilomar Conference on Systems and Computers ACSSC. 2006, 40: 701707.Google Scholar
 Tang H, Lau KY, Brodersen RW: Synchronization Schemes for Packet OFDM System. IEEE Int Conf Commun ICC 2003, 5: 33463350.Google Scholar
 Heiskala H, Terry JT: OFDM wireless LANs: A Theoretical and Practical Guide. Sams Publishing, Indianapolis 2002.Google Scholar
 Hung KC, Lin DW: Joint detection of integral carrier frequency offset and preamble index in OFDMA WiMAX downlink synchronization. IEEE Wireless Communications and Networking Conference (WCNC) 2007, 19591964.Google Scholar
 Lin YC, Su SL, Wang HC: A low complexity cell search method for IEEE 802.16e OFDMA systems. International Conference on Advanced Communication Technology (ICACT) 2009, 2: 980984.Google Scholar
 Harju L, Nurmi J: A synchronization coprocessor architecture for WCDMA/OFDM mobile terminal implementations. International Symposium on SystemonChip 2005, 141145.Google Scholar
 Troya A, Maharatna K, Krstic M, Grass E: Lowpower VLSI implementation of the inner receiver for OFDMBased WLAN systems. IEEE Trans Circ Syst 2008, 55: 672686.MathSciNetView ArticleGoogle Scholar
 Li X, Zheng Y, Lai Z: A low complexity sign ML detector for symbol and frequency synchronization of OFDM systems. IEEE Trans Consumer Electron 2006, 52: 317320. 10.1109/TCE.2006.1649644View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.