- Research Article
- Open Access
- Published:

# Techniques and Architectures for Hazard-Free Semi-Parallel Decoding of LDPC Codes

*EURASIP Journal on Embedded Systems***volume 2009**, Article number: 723465 (2009)

## Abstract

The layered decoding algorithm has recently been proposed as an efficient means for the decoding of low-density parity-check (LDPC) codes, thanks to the remarkable improvement in the convergence speed (2x) of the decoding process. However, pipelined semi-parallel decoders suffer from violations or "hazards" between consecutive updates, which not only violate the layered principle but also enforce the loops in the code, thus spoiling the error correction performance. This paper describes three different techniques to properly reschedule the decoding updates, based on the careful insertion of "idle" cycles, to prevent the hazards of the pipeline mechanism. Also, different semi-parallel architectures of a layered LDPC decoder suitable for use with such techniques are analyzed. Then, taking the LDPC codes for the wireless local area network (IEEE 802.11n) as a case study, a detailed analysis of the performance attained with the proposed techniques and architectures is reported, and results of the logic synthesis on a 65 nm low-power CMOS technology are shown.

## 1. Introduction

Improving the reliability of data transmission over noisy channels is the key issue of modern communication systems and particularly of wireless systems, whose spatial coverage and data rate are increasing steadily.

In this context, low-density parity-check (LDPC) codes have gained the momentum of the scientific community and they have recently been adopted as forward error correction (FEC) codes by several communication standards, such as the second generation digital video broadcasting (DVB-S2, [1]), the wireless metropolitan area networks (WMANs, IEEE 802.16e, [2]), the wireless local area networks (WLANs, IEEE 802.11n, [3]), and the 10 Gbit Ethernet (10Gbase-T, IEEE 802.2ae).

LDPC codes were first discovered by Gallager in the far 1960s [4] but have long been put aside until MacKay and Neal, sustained by the advances in the very high large-scale of integration (VLSI) technology, rediscovered them in the early 1990s [5]. The renewed interest and the success of LDPC codes is due to (i) the remarkable error-correction performance, even at low signal-to-noise ratios (SNRs) and for small block-lengths, (ii) the flexibility in the design of the code parameters, (iii) the decoding algorithm, very suitable for hardware parallelization, and last but not least (iv) the advent of structured or architecture-*aware* (AA) codes [6]. AA-LDPC codes reduce the decoder area and power consumption and improve the scalability of its architecture and so allow the full exploitation of the complexity/throughput design trade-offs. Furthermore, AA-codes perform so close to random codes [6], that they are the common choice of all latest LDPC-based standards.

Nowadays, data services and user applications impose severe low-complexity and low-power constraints and demand very high throughput to the design of practical decoders. The adoption of a fully parallel decoder architecture leads to impressive throughput but unfortunately is also so complex in terms of both area and routing [7] that a semi-parallel implementation is usually preferred (see [6, 8]).

So, to counteract the reduced throughput, designers can act at two levels: at the algorithmic level, by efficiently rescheduling the message-passing algorithm to improve its convergence rate, and at the architectural level, with the pipeline of the decoding process, to shorten the iteration time. The first matter can be solved with the *turbo-decoding message-passing* (TDMP) [6] or the *layered* decoding algorithm [9], while pipelined architectures are mandatory especially when the decoder employs serial processing units.

However, the pipeline mechanism may dramatically corrupt the error-correction performance of a layered decoder by letting the processing units not always work on the most updated messages. This issue, known as pipeline "hazard'', arises when the dependence between the elaborations is violated. The idea is then to reschedule the sequence of updates and to delay with "idle'' cycles the decoding process until newer data are available.

As an improvement to similar state-of-the-art works [10–13], this paper proposes three systematic techniques to optimally reschedule the decoding process in a way to minimize the number of idle cycles and achieve the maximum throughput. Also, this paper discusses different semi-parallel architectures, based on serial processing units and all supporting the reordering strategies, so as to attain the best trade-off between complexity and throughput for every LDPC code.

Semi-parallel architectures of LDPC decoder have recently been addressed in several papers, although none of them formally solves the issue of pipeline hazards and decoding idling. Gunnam et al. describe in [10] a pipelined semi-parallel decoder for WLAN LDPC codes, but the authors do not mention the issue of the pipeline hazards; only, the need of properly scrambling the sequence of data in order to clear some memory conflicts is described.

Boutillon et al. consider in [13] methods and architectures for layered decoding; the authors mention the problem of pipeline hazards (cut-edge conflict) and of using an output order different from the natural one in the processing units; nonetheless, the issue is not investigated further, and they simply modify the decoding algorithm to compute partial updates as in [14]. Although this approach allows the decoder to operate in full pipeline with no idle cycles, it is actually suboptimal in terms of both performance and complexity.

Similarly, Bhatt et al. propose in [11] a pipelined block-serial decoder architecture based on partial updates, but again, they do not investigate the dependence between elaborations.

In [12], Fewer et al. implement a semi-parallel TDMP decoder, but the authors boost the throughput by decoding two codewords in parallel and not by means of pipeline.

This paper is organised as follows. Section 2 recalls the basics of LDPC and of AA-LDPC codes and Section 3 summarizes the layered decoding algorithm. Section 4 introduces three different techniques to reduce the dependence between consecutive updates and analytically derives the related number of idle cycles. After this, Section 5 describes the VLSI architectures of a pipelined block-serial LDPC-layered decoder. Section 6 briefly reviews the WLAN codes used as a case study, while the performances of the related decoder are analysed in Section 7. Then, the results of the logic synthesis on a 65 nm low-power CMOS technology are discussed in Section 8, along with the comparison with similar state-of-the-art implementations. Finally, conclusions are drawn in Section 9.

## 2. Architecture-Aware Block-LDPC Codes

LDPC codes are linear block-codes described by a parity-check matrix establishing a certain number of (even) parity constraints on the bits of a codeword. Figure 1 shows the parity-check matrix of a very simple LDPC code with length bits and with parity constraints. LDPC codes are also effectively described in a graphical way through a Tanner graph [15], where each bit in the codeword is represented with a circle, known as variable-node (VN), and each parity-check constraint with a square, known as check-node (CN).

Recently, the joint design of code and decoder has blossomed in many works (see [8, 16]), and several principles have been established for the design of implementation-oriented AA-codes [6]. These can be summarized into (i) the arrangement of the parity-check matrix in squared subblocks, and (ii) the use of deterministic patterns within the subblocks. Accordingly, AA-LDPC codes are also referred to as *block*-LDPC codes [8].

The pattern used within blocks is the vital facet for a low-cost implementation of the interconnection network of the decoder and can be based either on permutations, as in [6] and for the class of -rotation codes [17], or on *circulants* or cyclic shifts of the identity matrix, as in [8] and in every recent standards [1–3].

AA-LDPC codes are defined by the number of *block*-columns , the number of *block*-rows , and the block-size , which is the size of the component submatrices. Their parity-check matrix can be conveniently viewed as , that is, as the expansion of a base-matrix with size . The expansion is accomplished by replacing the 1's in with permutations or circulants, and the 0's with null subblocks. Thus, the block-size is also referred to as *expansion*-factor, for a codeword length of the resulting LDPC code equal to and code rate .

A simple example of expansion or *vectorization* of a base-matrix is shown in Figure 1. The size, number, and location of the nonnull blocks in the code are the key parameters to get good error-correction performance and low-complexity of the related decoder.

## 3. Decoding of LDPC Codes

LDPC codes are decoded with the *belief propagation* (BP) or *message-passing* (MP) algorithm, that belong to the broader class of maximum *a posteriori* (MAP) algorithms. The BP algorithm has been proved to be optimal if the graph of the code does not contain cycles, but it can still be used and considered as a reference for practical codes with cycles. In the latter case, the sequence of the elaborations, also referred to as *schedule*, considerably affects the achievable performance.

The most common schedule for BP is the so-called two-phase or flooding schedule (FS) [18], where all parity-check nodes first, followed by all variable nodes then, are updated in sequence.

A different approach, taking the distribution of closed paths and girths in the code into account, has been described by Xiao and Banihashemi in [19]. Although *probabilistic* schedules are shown to outperform *deterministic* schedules, the random activation strategy of the processing nodes is not very suitable to HW implementation and adds significant complexity overheads.

The most attractive schedule is the *shuffled* or *layered* decoding [6, 9, 18, 20]. Compared to the FS, the layered schedule almost doubles the decoding convergence speed, both for codes with cycles and cycle-free [20]. This is achieved by looking at the code as a connection of smaller supercodes [6] or *layers* [9], exchanging intermediate reliability messages. Specifically, *a posteriori* messages are made available to the next layers immediately after computation and not at next iteration as in a conventional flooding schedule.

Layers can be any set of either CNs or VNs, and, accordingly, CN-*centric* (or horizontal) or VN-*centric* (or vertical) algorithms have been analyzed in [18, 20]. However, CN-*centric* solutions are preferable since they can exploit serial, flexible, and low-complexity CN processors.

The horizontal layered decoding (HLD) is summarized in Algorithm 1 and consists in the exchange of probabilistic reliability messages around the edges of the Tanner graph (see Figure 1) in the form of logarithms of likelihood ratios (LLRs); given the random variable , its LLR is defined as

**Algorithm 1:** Horizontal layered decoding.

**input**: a-priori LLR ,

**output**: a-Posteriori hard-decisions

() // *Messages initialization*

() , , , ,

;

() **while** ( & !Convergence) **do**

() // *Loop on all layers*

() **for **
** to **
** do**

() // *Check-node update*

() **forall **
** do**

() // *Sign update*

() ;

() // *Magnitude update*

() ;

() // *Soft-output update*

()

() **end**

() **end**

() ;

() **end**

In Algorithm 1, is the th *a priori* LLR of the received bits, with and the length of the codeword, is the overall number of parity-check constraints, and the number of decoding iterations. Also, is the set of VNs connected to the th CN, represents the check-to-variable (*c2v*) reliability message sent from CN to VN at iteration , and is the total information or *soft-output* (SO) of the th bit in the codeword (see Figure 1).

For the sake of an easier notation, it is assumed here that a layer corresponds to a single row of the parity-check matrix. Before being used by the next CN or layer, SOs are refined with the involved *c2v* message, as shown in line 13, and thanks to this mechanism, faster convergence is achieved.

Magnitudes are updated with the binary operator [21] defined as for . Following an approach similar to Jones et al. [22], the updating rule of magnitudes is further simplified with the method described in [23], which proved to yield very good performance. Here, only two values are computed and propagated for the magnitude of *c2v* messages; specifically, if we define

the index of the smallest variable-to-check (*v2c*) message entering CN , then a dedicated *c2v* message is computed in response to VN :

while all the remaining VNs receive one common, *nonmarginalized* value for magnitude given by

## 4. Decoding Pipelining and Idling

The data-flow of a pipelined decoder with serial processing units is sketched in Figure 2. A centralized memory unit keeps the updated soft-outputs, computed by the node processors (NPs) according to Algorithm 1. If we denote with the number of nonnull blocks in layer , that is, the degree of layer , then the processor takes clock cycles to serially load its inputs. Then, refined values are written back in memory (after scrambling or permutation) with the latency of clock cycles, and this operation takes again clock cycles. Overall, the processing time of layer is then clock cycles, as shown in Figure 3(a).

If the decoder works in pipeline, time is saved by overlapping the phases of elaboration, writing-out and reading, so that data are continuously read from and written into memory, and a new layer is processed every clock cycles (see Figure 3(b)).

Although highly desirable, the pipeline mechanism is particularly challenging in a layered LDPC decoder, since the soft-outputs retrieved from memory and used for the current elaboration could not be always up-to-date, but newer values could be still in the pipeline. This issue, known as pipeline hazard, prevents the use and so the propagation of always up-to-date messages and spoils the error-correction performance of the decoding algorithm.

The solution investigated in this paper is to insert *null* or *idle* cycles between consecutive updates, so that a node processor is suspended to wait for newer data. The number of idle cycles must be kept as small as possible since it affects the iteration time and so the decoding throughput. Its value depends on the actual sequence of layers updated by the decoder as well as on the order followed to update messages within a layer.

Three different strategies are described in this section, to reduce the dependence between consecutive updates in the HLD algorithm and, accordingly, the number of idle cycles. These differ in the order followed for acquisition and writing-out of the decoding messages and constitute a powerful tool for the design of "layered'', hazard-free, LDPC codes.

### 4.1. System Notation

Without any lack of generality, let us identify a layer with one single parity-check node and focusing on the set of soft-outputs participating to layer , let us define the following subsets:

(i), the set of SOs in common with layer ;

(ii), the set of SOs in common with layer and not in ;

(iii), the set of SOs in common with both layers and ;

(iv), the set of SOs in common with layer and not in or ;

(v), the set of SOs in common with layer but not in , , ;

(vi), the set of SOs in common with both layers and , but not in or ;

(vii), the set of remaining SOs.

In the definitions above the notation means the relative complement of in or the set-theoretic difference of and . Let us also define the following cardinalities: (degree of layer ), , , , , , , .

### 4.2. Equal Output Processing

First, let us consider a very straightforward and implementation friendly architecture of the node processor that updates (and so delivers) the soft-output messages with the same order used to take them in.

In such a case it would be desirable to (i) postpone the acquisition of messages updated by the previous layer, that is, messages in , and (ii) output the messages in as soon as possible to let the next layer start earlier. Actually, the last constraint only holds when does not include any message common to layer , that is, when ; otherwise, the set could be acquired at any time before .

Figure 4 shows the I/O data streams of an equal output processing (EOP) unit. Here, is the latency of the SO data-path, including the elaboration in the NP, the scrambling, and the two memory accesses (reading and writing). Focusing on layer , the set cannot be assigned to any specific position within , since the whole is acquired according to the same order used by layer to output (and so also acquire) the sets and . For this reason, the situation plotted in Figure 4 is only for the sake of a clearer drawing.

With reference to Figure 4, pipeline hazards are cleared if idle cycles are spent between layer and so that

with for and otherwise. This means that if is empty, then the messages in do not need to be waited for. The solution to (5) with minimum latency is

Note that (5) and (6) only hold under the hypothesis of leading within . If this is not the case, up to extra idle cycles could be added if is output last within .

So far, we have only focused on the interaction between two consecutive layers; however, violations could also arise between layer and . Despite this possibility, this issue is not treated here, as it is typically mitigated by the same idle cycles already inserted between layers and and between layers and .

### 4.3. Reversed Output Processing

Depending on the particular structure of the parity-check matrix , it may occur that the most of the messages of layer in common with layer are also shared with layer , that is, and . If this condition holds, as for the WLAN LDPC codes (see Figure 11), it can be worth reversing the output order of SOs so that the messages in can be both acquired last and output first.

Figure 5(a) shows the I/O streams of a reversed output processing (ROP) unit. Exploiting the reversal mechanism, the set is acquired second-last, just before , so that it is available earlier for layer .

Following a reasoning similar to EOP, the situation sketched in Figure 5(a) where is delivered first within is just for an easier representation, and the condition for hazard-free layered decoding is now

Indeed, when , one could output first in , and so get rid of the term . However, since is actually left floating within , (7) represents again a best-case scenario, and up to extra idle cycles could be required. From (7), the minimum latency solution is

Similarly to EOP, the ROP strategy also suffers from pipeline hazards between three consecutive layers, and because of the reversed output order, the issue is more relevant now. This situation is sketched in Figure 5(b), where the sets , , and are managed similarly to , and . The ROP strategy is then instructed to acquire the set later and to output earlier. However, the situation is complicated by the fact that the set may not entirely coincide with ; rather it is , since some of the messages in can be found in . This is highlighted in Figure 5(b), where those messages of and not delivered to are shown in dark grey.

To clear the hazards between three layers, additional idle cycles are added in the number of

where is the acquisition margin on layer , and is the writing-out margin on layer . These can be computed under the assumption of no hazard between layer and (i.e., is aligned with thanks to as shown in Figure 5(b)) and are given by

The margin is actually nonnull only if ; otherwise, under the hypothesis that (i) the set is output first within , and (ii) within , the messages not in are output last.

Overall, the number of idle cycles of ROP is given by

### 4.4. Unconstrained Output Processing

Fewer idle cycles are expected if the orders used for input and output are not constrained to each other. This implies that layer can still delay the acquisition of the messages updated by layer (i.e., messages in ) as usual, but at the same time the messages common to layer (i.e., in ) can also be delivered earlier.

The input and output data streams of an unconstrained output processing (UOP) unit are shown in Figure 6. Now, hazard-free layered decoding is achieved when

which yields

Regarding the interaction between three consecutive layers, if the messages common to layer (i.e., in ) are output just after , and if on layer , the set is taken just before , then there is no risk of pipeline hazard between layer and .

### 4.5. Decoding of Irregular Codes

A serial processor cannot process consecutive layers with decreasing degrees, , as the pipeline of the internal elaborations would be corrupted and the output messages of the two layers would overlap in time. This is not but another kind of pipeline hazard, and again, it can be solved by delaying the update of the second layer with idle cycles.

Since this type of hazard is independent of that seen above, the same idle cycles may help to solve both issues. For this reason, the overall number of idle cycles becomes

with being computed according to (6), (11), or (13).

### 4.6. Optimal Sequence of Layers

For a given reordering strategy, the overall number of idle cycles per decoding iteration is a function of the actual sequence of layers used for the decoding. For a code with layers, the optimal sequence of layer minimizing the time spent in idle is given by

where is the number of idle cycles between layer and for the generic permutation and is given by (14), and is the set of the possible permutations of layers.

The minimization problem in (15) can be solved by means of a brute-force computer search and results in the definition of a permuted parity-check matrix , whose layers are scrambled according to the optimal permutation . Then, within each layer of , the order to update the nonnull subblocks is given by the strategy in use among EOP, ROP, and UOP.

### 4.7. Summary and Results

The three methods proposed in this section are differently effective to minimize the overall time spent in idle. Although UOP is expected to yield the smallest latency, the results strongly depend on the considered LDPC code, and ROP and EOP can be very close to UOP. As a case-example, results will be shown in Section 7 for the WLAN LDPC codes.

However, the effectiveness of the individual methods must be weighed up in view of the requirements of the underlying decoder architecture and the costs of its hardware implementation, which is the objective of Section 5. Thus, UOP generally requires bigger complexity in hardware, and EOP or ROP can be preferred for particular codes.

## 5. Decoder Architectures

Low complexity and high throughput are key features demanded to every competitive LDPC decoder, and to this extent, semi-parallel architectures are widely recognised as the best design choice.

As shown in [6, 8, 12] to mention just a few, a semi-parallel architecture includes an array of processing elements with size usually equal to the expansion factor of the base-matrix . Therefore, the HLD algorithm described in Section 3 must be intended in a vectorized form as well, and in order to exploit the code structure, a layer counts consecutive parity-check nodes. Layers (in the number of ) are updated in sequence by the check-node units (CNUs), and an array of SOs () and of
*c2v* messages () are concurrently updated at every clock cycle. Since the parity-check equations in a layer are independent by construction, that is, they do not share SOs, the analysis of Section 4 still holds in a vectorized form.

The CNUs are designed to serially update the *c2v* magnitudes according to (3) and (4), and any arbitrary order of the *c2v* messages (and so of SOs, see line 13 of Algorithm 1) can be easily achieved by properly multiplexing between the two values as also shown in [23]. It must be pointed out that the 2-output approximation described in Section 3 is pivotal to a low-complexity implementation of EOP, ROP, or UOP in the CNU. However, the same strategies could also be used with a different (or even no) approximation in the CNU, although the cost of the related implementation would probably be higher.

Three VLSI architectures of a layered decoder will be described, that differ in the management of the memory units of both SO and *c2v*, and so result in different implementation costs in terms of memory (RAM and ROM) and logic.

### 5.1. Local Variable-to-Check Buffer

The most straightforward architecture of a vectorized layered decoder is shown in Figure 7. Here, the arrays of *v2c* messages entering the CNUs during the update of layer , are computed on-the-fly as with , and both the arrays of *c2v* and SO messages are retrieved from memory.

Then, the updated *c2v* messages are used to refine every array of SOs belonging to layer : according to line 13 of Algorithm 1, this is done by adding the new *c2v* array to the input *v2c* array . Since the CNUs work in pipeline, while the update of layer is still progress, the array of the *v2c* messages belonging to layer is already being computed as , with . For this reason, needs to be temporarily stored in a local buffer as shown in Figure 7. The buffer is vectorized as well and stores messages, with the maximum CN degree in the code.

Before being stored back in memory, the array is circularly shifted and made ready for its next use, by applying *compound* or *incremental* rotations [12]; this operation is carried out by the *circular* shifting network of Figure 7, and more details about its architecture are available in [24].

The *v2c* buffer is the key element that allows the architecture to work in pipeline. This has to sustain one reading and one writing access concurrently and can be efficiently implemented with shift-register based architectures for EOP (first-in, first-out, FIFO buffer) and ROP (last-in, first-out, LIFO buffer). On the contrary, UOP needs to map the buffer onto a *dual*-port memory bank, whose (reading) address is provided by and extra configuration memory (ROM).

### 5.2. Double Memory Access

The buffer of Arch. V-A can be removed if the *v2c* messages are computed twice on-the-fly, as shown in Figure 8: the first time to feed the array of CNUs, and then to update the SOs. To this aim, a further reading is required to get the arrays and from memory, and so recompute the array on the CNUs output.

It follows that *three*-port memories are needed for both SO and *c2v* messages since three concurrent accesses have to be supported: two readings (see ports and in Figure 8) and one writing. This memory can be implemented by distributing data on several banks of customary *dual*-port memory, in such a way that two readings always involve different banks. Actually, in a layered decoder a same memory location needs to be accessed several times per iteration and concurrently to several other data, so that resorting to only two memory banks would be unfeasible. On the other hand, the management of a higher number of banks would add a significant overhead to the complexity of the whole design.

The proposed solution is sketched in Figure 9 and is based on only two banks (A and B) but, to clear access conflicts, some data are redundantly stored in both the banks (see elements C1 and C2 in the example of Figure 9).

The most trivial and expensive solution is achieved when both banks are a full copy or a *mirror* of the original memory as in [11], which corresponds to redundancy. Conversely to this route, data can be selectively assigned to the two banks through computer search aiming at a minimum redundancy.

Roughly speaking, if we denote by the cardinality of the set of data (SO or *c2v* messages) read concurrently to the th data for , then the higher is (for a given ), the higher is the expected redundancy. So, a small redundancy is experienced by the *c2v* memory, since each *c2v* message can collide with at most two other data (i.e., ), while a higher redundancy is associated to the SO memory, since every SO can face up to conflicts, with being the degree of the th variable node, typically greater than (especially for low-rate codes).

Indeed, the issue of memory partitioning and the reordering techniques described in Section 4 are linked to each other: whenever the CNUs are in idle, only one reading is performed. Therefore, an overall system optimization aiming at minimizing the iteration latency and the amount of memory redundancy at the same time could be pursued; however, due to the huge optimization space, this task is almost unfeasible and is not considered in this work.

### 5.3. Storage of Variable-to-Check Messages

During the elaboration of a generic layer, a certain *v2c* message is needed twice, and a local buffer or multiple memory reading operations were implemented in Arch. V-A and Arch. V-B, respectively.

A third way of solving the problem is computing the array of *v2c* messages only once per iteration, like in Arch. V-A, but instead of using a local buffer, the *v2c* messages are precomputed and stored in the SO memory ready for the next use, as sketched in Figure 10. A similar architecture is used in [10, 16] but the issue of decoding pipeline is not clearly stated there.

In this way, the SO memory turns into a *v2c* memory with the following meaning: the array updated by layer is stored in memory after marginalization with the *c2v* message , with being the index of the next layer reusing the same array of SOs, . In other words, the array of *v2c* messages involved in the next update of the same block-column is precomputed. Therefore, the data stored in the *v2c* memory are used twice, first to feed the array of CNUs, and then for the SOs update.

Similarly to Arch. V-B, a *three*-port memory would be required because of the decoding pipeline; the same considerations of Section 5.2 still hold, and an optimum partitioning of the *v2c* memory onto two banks with some redundancy can be found. Note that, as opposed to Arch. V-B, a customary dual-port memory is enough for *c2v* messages.

As far as the complexity is concerned, at first glance this solution seems to be preferable to Arch. V-B since it needs only two stages of parallel adders while the *c2v* memory is not split. However, the management of the reading ports of the *v2c* memory introduces significant overheads, since after the update of the soft outputs by layer , the memory controller must be aware of what is the next layer using the same soft outputs . This information needs to be stored in a dedicated configuration memory, whose size and area can be significant, especially in a multilength, multirate decoder.

## 6. A Case Study: The IEEE 802.11n LDPC Codes

### 6.1. LDPC Code Construction

The WLAN standard [3] defines AA-LDPC codes based on *circulants* of the identity matrix. Three different codeword lengths are supported, , and , each coming with four code rates, , , and , for a total of different codes. As a distinguishing feature, a different block-size is used for each codeword length, that is, , and , respectively; accordingly, every code counts
*block*-columns, while the *block*-rows (layers) are in the number of for code rates , , and , respectively.

An example of the base-matrix for the code with length and rate is shown in Figure 11.

### 6.2. Multiframe Decoder Architecture

In order to attain an adequate throughput for every WLAN codes, the decoder must include a number of CNUs at least equal to . This means that two thirds of the processors would remain unused with the shortest codes.

In the latter case, the throughput can be increased thanks to a *multiframe* approach, where frames of the code with block-size are decoded in parallel. A similar solution is described in [12], but in that case two different frames are decoded in time-division multiplexing by exploiting the 2 nonoverlapped phases of the flooding algorithm. Here, frames are decoded concurrently, and more specifically, three different frames of the shortest code can be assigned to a cluster of 27 CNUs each.

Note that to work properly, the circular shifting network must support concurrent subrotations as described in [24].

## 7. Decoder Performance

As to give a practical example of the reordering strategies described in Section 4, Figure 12 shows the data flow related to the update of layer 0 for the WLAN code of Figure 11. While 6 idle cycles are required following the original, natural order of updates (see Figure 12(a)), EOP needs 5 cycles (see Figure 12(b)), ROP reduces them to 1 (see Figure 12(c)), while no idle cycle is used by UOP (see Figure 12(d)). The subsets defined in Section 4.1 are also shown in Figure 5, along with the optimal sequence of layers followed for decoding.

### 7.1. Latency and Throughput

The latency of a pipelined LDPC decoder can be expressed as

with being the clock period, being the number of iterations, being the number of nonnull blocks in the code, being the number of idle cycles per iteration, being the cycles to empty the decoder pipelin and finally, being the cycles for the input/output interface. Among the parameters above, is set for good error-correction performance, is a code-dependent parameter, and is fixed by the I/O management; thus, for a minimum latency, the designer can only act on , whose value can be optimised with the techniques of Section 4.

Focusing on the IEEE 802.11n codes, Table 1 shows the overall number of cycles for 12 iterations (), the number of idle cycles per iteration (), the percentage of idle cycles with respect to the total (idling %), and the throughput at the clock frequency of MHz.

The latter is expressed in information bits decoded per time unit and is also referred to as net throughput:

where is the number of frames decoded in parallel. For this reason, the figures of Table 1 for the short codes are very similar to those for the long codes (); on the contrary, the middle codes do not benefit from the same mechanism (i.e., ) and their throughput is scaled down by a factor 2/3.

The results of Table 1 are for every technique of Section 4 as well as for the original codes before optimization. Although EOP clearly outperforms the original codes, better results are achieved with ROP and UOP for the WLAN case example, where at most 14% and 11% of the decoding time are spent in idle, respectively. On average, the decoding time decreases from 7.6 to 6.7 ns with EOP and even to 5.3 ns with ROP and 5.1 ns with UOP. This behaviour can be explained by considering that for the WLAN codes the term found in (6) for EOP is significantly nonnull, while comparing (8) to (13), ROP and UOP basically differ for the term , which is negligible for the WLAN codes.

### 7.2. Error-Correction Performance

Figure 13 compares the floating point frame error rate (FER) after 12 decoding iterations of a pipelined decoder using EOP, ROP, and UOP with a reference curve obtained by simulating the original parity-check matrix before optimization, in a nonpipelined decoder. Two simulations were run for each strategy, one with the proper number of idle cycles (curves with full markers), and the other without idle cycles and referred to as *full pipeline* mode (curves with empty markers).

As expected, the three strategies reach the reference curve of the HLD algorithm when properly idled. Then, in case of full pipeline (), the performance of EOP are spoiled, while ROP and UOP only pay about 0.6 and 0.3 dB, respectively. This means that the reordering has significantly reduced the dependence between layers and only few hazards arise without idle cycles.

Similarly to EOP, no received codeword is successfully decoded even at high SNRs (i.e., ) if the original code descriptors are simulated in full pipeline. This confirms once more the importance of idle cycles in a pipelined HLD decoding decoder and motivates the need of an optimization technique.

Considering the same scenario of Figure 13, Figure 14 shows the convergence speed, measured in average number of iterations, of the layered decoding algorithm. The curves confirm that HLD needs one half of the number of iterations of the flooding schedule, on average, and show that the full pipeline mode is also penalized in terms of speed.

## 8. Implementation Results

The complexity of an LDPC decoder for IEEE 802.11n codes was derived through logical synthesis on a low-power 65 nm CMOS technology targeting MHz. Every architecture of Section 5 was considered for implementation, each one supporting the three reordering strategies, for a total of 9 combinations. For good error correction performance, input LLRs and *c2v* messages were represented on 5 bits, while internal SO and *v2c* messages on 7 bits.

Table 2 summarizes the complexity of the different designs in terms of logic, measured in equivalent Kgates and number of RAM and ROM bits. Equivalent gates are counted by referring to the low-drive, 2-input NAND cell, whose area is 2.08 for the target technology library. Arch. V-A needs the highest number of memory bits due to the local variable-to-check buffer, but its logic is smaller since it requires no additional hardware resources (adders) and less configuration bits.

Because of the partitioning of both the SO and the *c2v* memories, Arch. V-B needs more logic resources and more memory bits than Arch. V-C (both for data and configuration). The redundancy ratios and of the SO and *c2v* memory in Arch. V-B, respectively, and of the *v2c* memory in Arch. V-C, are also reported in Table 2.

As a matter of fact, the three architectures are very similar in complexity and performance, and, for a given set of LDPC codes, the designer can select the most suitable solution by trading-off decoding latency and throughput at the system level, with the requirements of logic and memory in terms of area, speed, and power consumption at the technology level.

Table 3 compares the design of a decoder for IEEE 802.11n based on Arch. V-C with UOP with similar state-of-the-art implementations: a parallel decoder by Blanskby and Howland [7], a 2048-bit rate 1/2 TDMP decoder by Mansour and Shanbhag [25], a similar design for WLAN by Gunnam et al. [10], and a decoder for WiMAX by Brack et al. [26]. Here, for a fair comparison, the throughput is expressed in channel bits decoded per time unit; that is, it is the *channel* throughput .

For the comparison, we focused on the architectural efficiency defined as

which represents the average number of clock cycles to update one block of . In decoders based on serial functional units it is and the higher is, the less efficient is the architecture. Actually, can reach 1 only when the dependence between consecutive layers is solved at the code design level. This is the case of two WiMAX codes (specifically, class 1/2 and class 2/3B codes) which are hazard-free (or layered) "by construction'', thus explaining the very low value of achieved by [26]. However, [26] is as efficient as our design () on the remaining nonlayered WiMAX codes, but the authors do not perform layered decoding on such codes.

For decoders with parallel processing units (see [7, 25]) the architectural efficiency becomes a measure of the parallelization used in the processing units and it can be expressed as with being the average check node degree. Indeed, in a two-phase decoder, the number of blocks can be equivalently defined as the overall number of exchanged messages, divided by the number of functional units. If E is the number of edges in the code, then , which is an index of the parallelization used in the processors.

The different designs were also compared in terms of energy efficiency, defined as the energy spent per coded bit and per decoding iteration. This is computed as

with being the decoding energy and being the power consumption. The latter was estimated with Synopsys Power Compiler and was averaged out over three different SNRs (corresponding to different convergence speeds) and includes the power dissipated in the memory units (about of the total). In terms of energy, our design is more efficient than [25] and gets close to the parallel decoder in [7].

Since the design in [10] is for the same WLAN LDPC codes and implements a similar layered decoding algorithm with the same number of processing units, a closer inspection is compulsory. Thanks to the idle optimization, our solution is more efficient in terms of throughput, the saving in efficiency ranging from to. Then, although our design saves about 70 mW in power consumption with respect to [10], the related energy efficiency has not been included in Table 2 since the reference scenario used to estimate the power consumption (238 mW) was not clearly defined. Finally, although curves for error correction performance are not available in [10], penalties are expected in view of the smaller accuracy used to represent (5 bits) and SOs (6 bits) messages.

## 9. Conclusions

An effective method to counteract the pipeline hazards typical of block-serial layered decoders of LDPC codes has been presented in this paper. This method is based on the rearrangement of the decoding elaborations in order to minimize the number of idle cycles inserted between updates and resulted in three different strategies named equal, reversed, and unconstrained output (EOP, ROP, and UOP) processing.

Then, different semi-parallel VLSI architectures of a layered decoder for architecture-aware LDPC codes supporting the methods above have been described and applied to the design of a decoder for IEEE 802.11n LDPC codes.

The synthesis of the proposed decoder on a 65 nm low-power CMOS technology reached the clock frequency of 240 MHz, which corresponds to a net throughput ranging from 131 to 334 Mbps with UOP and 12 decoding iterations, outperforming similar designs.

This work has proved that the layered decoding algorithm can be extended with no modifications nor approximations to every LDPC code, despite the interconnections on its parity-check matrix, provided that idle cycles are used to maintain the dependencies between the updates in the algorithm.

Also, the paradigm of code-decoder codesign has been reinforced in this work, since not only the described techniques have shown to be very effective to counteract the pipeline hazards but also they provide at the same time useful guidelines for the design of good, hazard-free, LDPC codes. To this extent, it is then overcome the assumption that consecutive layers do not have to share soft-outputs, like the WiMAX class 1/2 and 2/3B codes do, thus leaving more room to the optimization of the code performance at the level of the code design.

## References

- 1.
**Satellite digital video broadcasting of second generation (DVB-S2)**ETSI Standard EN302307, February 2005 - 2.
IEEE Computer Society :

**Air Interface for Fixed and Mobile Broadband Wirelss Access Systems.**IEEE Std 802.16e^{TM}-2005, February 2006 - 3.
**IEEE P802.11n**^{TM}**/D1.06**Draft amendment to Standard for high throughput, 802.11 Working Group, November 2006 - 4.
Gallager R:

*Low-density parity-check codes, Ph.D. dissertation*. Massachusetts Institutes of Technology; 1960. - 5.
MacKay D, Neal R:

**Good codes based on very sparse matrices.***Proceedings of the 5th IMA Conference on Cryptography and Coding, 1995* - 6.
Mansour MM, Shanbhag NR:

**High-throughput LDPC decoders.***IEEE Transactions on Very Large Scale Integration (VLSI) Systems*2003,**11**(6):976-996. - 7.
Blanksby A, Howland C:

**A 690-mW 1-Gb/s 1024-b, rate-1/2 lowdensity parity-check code decoder.***IEEE Journal of Solid-State Circuits*2002,**37**(3):404-412. 10.1109/4.987093 - 8.
Zhong H, Zhang T:

**Block-LDPC: a practical LDPC coding system design approach.***IEEE Transactions on Circuits and Systems I*2005,**52**(4):766-775. - 9.
Hocevar DE:

**A reduced complexity decoder architecture via layered decoding of LDPC codes.***Proceedings of the IEEE Workshop on Signal Processing Systems (SISP '04), 2004*107-112. - 10.
Gunnam K, Choi G, Wang W, Yeary M:

**Multi-rate layered decoder architecture for block LDPC codes of the IEEE 802.11n wireless standard.***Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '07), May 2007*1645-1648. - 11.
Bhatt T, Sundaramurthy V, Stolpman V, McCain D:

**Pipelined block-serial decoder architecture for structured LDPC codes.***Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), April 2006***4:**225-228. - 12.
Fewer CP, Flanagan MF, Fagan AD:

**A versatile variable rate LDPC codec architecture.***IEEE Transactions on Circuits and Systems I*2007,**54**(10):2240-2251. - 13.
Boutillon E, Tousch J, Guilloud F:

**LDPC decoder, corresponding method, system and computer program.**US patent no. 7,174,495 B2, February 2007 - 14.
Rovini M, Rossi F, Ciao P, L'Insalata N, Fanucci L:

**Layered decoding of non-layered LDPC codes.***Proceedings of the 9th Euromicro Conference on Digital System Design (DSD '06), August-September 2006* - 15.
Tanner R:

**A recursive approach to low complexity codes.***IEEE Transactions on Information Theory*1981,**27**(5):533-547. 10.1109/TIT.1981.1056404 - 16.
Zhang H, Zhu J, Shi H, Wang D:

**Layered approx-regular LDPC: code construction and encoder/decoder design.***IEEE Transactions on Circuits and Systems I*2008,**55**(2):572-585. - 17.
Echard R, Chang S-C:

**The**-**rotation low-density parity check codes.***Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM '01), November 2001*980-984. - 18.
Guilloud F, Boutillon E, Tousch J, Danger J-L:

**Generic description and synthesis of LDPC decoders.***IEEE Transactions on Communications*2006,**55**(11):2084-2091. - 19.
Xiao H, Banihashemi AH:

**Graph-based message-passing schedules for decoding LDPC codes.***IEEE Transactions on Communications*2004,**52**(12):2098-2105. 10.1109/TCOMM.2004.838730 - 20.
Sharon E, Litsyn S, Goldberger J:

**Efficient serial message-passing schedules for LDPC decoding.***IEEE Transactions on Information Theory*2007,**53**(11):4076-4091. - 21.
Zarkeshvari F, Banihashemi A:

**On implementation of min-sum algorithm for decoding low-density parity-check (LDPC) codes.***Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM '02), November 2002***2:**1349-1353. - 22.
Jones C, Valles E, Smith M, Villasenor J:

**Approximate-MIN constraint node updating for LDPC code decoding.***Proceedings of the IEEE Military Communications Conference (MILCOM '03), October 2003***1:**157-162. - 23.
Rovini M, Rossi F, L'Insalata N, Fanucci L:

**High-precision LDPC codes decoding at the lowest complexity.***Proceedings of the 14th European Signal Processing Conference (EUSIPCO '06), September 2006* - 24.
Rovini M, Gentile G, Fanucci L:

**Multi-size circular shifting networks for decoders of structured LDPC codes.***Electronics Letters*2007,**43**(17):938-940. 10.1049/el:20071157 - 25.
Mansour MM, Shanbhag NR:

**A 640-Mb/s 2048-bit programmable LDPC decoder chip.***IEEE Journal of Solid-State Circuits*2006,**41**(3):684-698. 10.1109/JSSC.2005.864133 - 26.
Brack T, Alles M, Kienle F, Wehn N:

**A synthesizable IP core for WiMax 802.16E LDPC code decoding.***Proceedings of the 17th IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC '06), September 2006*1-5.

## Author information

## Rights and permissions

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

#### Received

#### Revised

#### Accepted

#### Published

#### DOI

### Keywords

- Clock Cycle
- Wireless Local Area Network
- LDPC Code
- VLSI Architecture
- Tanner Graph