# Techniques and Architectures for Hazard-Free Semi-Parallel Decoding of LDPC Codes

- Massimo Rovini
^{1}Email author, - Giuseppe Gentile
^{1}, - Francesco Rossi
^{1}and - Luca Fanucci
^{1}

**2009**:723465

**DOI: **10.1155/2009/723465

© Massimo Rovini et al. 2009

**Received: **4 March 2009

**Accepted: **27 July 2009

**Published: **15 September 2009

## Abstract

The layered decoding algorithm has recently been proposed as an efficient means for the decoding of low-density parity-check (LDPC) codes, thanks to the remarkable improvement in the convergence speed (2x) of the decoding process. However, pipelined semi-parallel decoders suffer from violations or "hazards" between consecutive updates, which not only violate the layered principle but also enforce the loops in the code, thus spoiling the error correction performance. This paper describes three different techniques to properly reschedule the decoding updates, based on the careful insertion of "idle" cycles, to prevent the hazards of the pipeline mechanism. Also, different semi-parallel architectures of a layered LDPC decoder suitable for use with such techniques are analyzed. Then, taking the LDPC codes for the wireless local area network (IEEE 802.11n) as a case study, a detailed analysis of the performance attained with the proposed techniques and architectures is reported, and results of the logic synthesis on a 65 nm low-power CMOS technology are shown.

## 1. Introduction

Improving the reliability of data transmission over noisy channels is the key issue of modern communication systems and particularly of wireless systems, whose spatial coverage and data rate are increasing steadily.

In this context, low-density parity-check (LDPC) codes have gained the momentum of the scientific community and they have recently been adopted as forward error correction (FEC) codes by several communication standards, such as the second generation digital video broadcasting (DVB-S2, [1]), the wireless metropolitan area networks (WMANs, IEEE 802.16e, [2]), the wireless local area networks (WLANs, IEEE 802.11n, [3]), and the 10 Gbit Ethernet (10Gbase-T, IEEE 802.2ae).

LDPC codes were first discovered by Gallager in the far 1960s [4] but have long been put aside until MacKay and Neal, sustained by the advances in the very high large-scale of integration (VLSI) technology, rediscovered them in the early 1990s [5]. The renewed interest and the success of LDPC codes is due to (i) the remarkable error-correction performance, even at low signal-to-noise ratios (SNRs) and for small block-lengths, (ii) the flexibility in the design of the code parameters, (iii) the decoding algorithm, very suitable for hardware parallelization, and last but not least (iv) the advent of structured or architecture-*aware* (AA) codes [6]. AA-LDPC codes reduce the decoder area and power consumption and improve the scalability of its architecture and so allow the full exploitation of the complexity/throughput design trade-offs. Furthermore, AA-codes perform so close to random codes [6], that they are the common choice of all latest LDPC-based standards.

Nowadays, data services and user applications impose severe low-complexity and low-power constraints and demand very high throughput to the design of practical decoders. The adoption of a fully parallel decoder architecture leads to impressive throughput but unfortunately is also so complex in terms of both area and routing [7] that a semi-parallel implementation is usually preferred (see [6, 8]).

So, to counteract the reduced throughput, designers can act at two levels: at the algorithmic level, by efficiently rescheduling the message-passing algorithm to improve its convergence rate, and at the architectural level, with the pipeline of the decoding process, to shorten the iteration time. The first matter can be solved with the *turbo-decoding message-passing* (TDMP) [6] or the *layered* decoding algorithm [9], while pipelined architectures are mandatory especially when the decoder employs serial processing units.

However, the pipeline mechanism may dramatically corrupt the error-correction performance of a layered decoder by letting the processing units not always work on the most updated messages. This issue, known as pipeline "hazard'', arises when the dependence between the elaborations is violated. The idea is then to reschedule the sequence of updates and to delay with "idle'' cycles the decoding process until newer data are available.

As an improvement to similar state-of-the-art works [10–13], this paper proposes three systematic techniques to optimally reschedule the decoding process in a way to minimize the number of idle cycles and achieve the maximum throughput. Also, this paper discusses different semi-parallel architectures, based on serial processing units and all supporting the reordering strategies, so as to attain the best trade-off between complexity and throughput for every LDPC code.

Semi-parallel architectures of LDPC decoder have recently been addressed in several papers, although none of them formally solves the issue of pipeline hazards and decoding idling. Gunnam et al. describe in [10] a pipelined semi-parallel decoder for WLAN LDPC codes, but the authors do not mention the issue of the pipeline hazards; only, the need of properly scrambling the sequence of data in order to clear some memory conflicts is described.

Boutillon et al. consider in [13] methods and architectures for layered decoding; the authors mention the problem of pipeline hazards (cut-edge conflict) and of using an output order different from the natural one in the processing units; nonetheless, the issue is not investigated further, and they simply modify the decoding algorithm to compute partial updates as in [14]. Although this approach allows the decoder to operate in full pipeline with no idle cycles, it is actually suboptimal in terms of both performance and complexity.

Similarly, Bhatt et al. propose in [11] a pipelined block-serial decoder architecture based on partial updates, but again, they do not investigate the dependence between elaborations.

In [12], Fewer et al. implement a semi-parallel TDMP decoder, but the authors boost the throughput by decoding two codewords in parallel and not by means of pipeline.

This paper is organised as follows. Section 2 recalls the basics of LDPC and of AA-LDPC codes and Section 3 summarizes the layered decoding algorithm. Section 4 introduces three different techniques to reduce the dependence between consecutive updates and analytically derives the related number of idle cycles. After this, Section 5 describes the VLSI architectures of a pipelined block-serial LDPC-layered decoder. Section 6 briefly reviews the WLAN codes used as a case study, while the performances of the related decoder are analysed in Section 7. Then, the results of the logic synthesis on a 65 nm low-power CMOS technology are discussed in Section 8, along with the comparison with similar state-of-the-art implementations. Finally, conclusions are drawn in Section 9.

## 2. Architecture-Aware Block-LDPC Codes

Recently, the joint design of code and decoder has blossomed in many works (see [8, 16]), and several principles have been established for the design of implementation-oriented AA-codes [6]. These can be summarized into (i) the arrangement of the parity-check matrix in squared subblocks, and (ii) the use of deterministic patterns within the subblocks. Accordingly, AA-LDPC codes are also referred to as *block*-LDPC codes [8].

The pattern used within blocks is the vital facet for a low-cost implementation of the interconnection network of the decoder and can be based either on permutations, as in [6] and for the class of
-rotation codes [17], or on *circulants* or cyclic shifts of the identity matrix, as in [8] and in every recent standards [1–3].

AA-LDPC codes are defined by the number of *block*-columns
, the number of *block*-rows
, and the block-size
, which is the size of the component submatrices. Their parity-check matrix
can be conveniently viewed as
, that is, as the expansion of a base-matrix
with size
. The expansion is accomplished by replacing the 1's in
with permutations or circulants, and the 0's with null subblocks. Thus, the block-size
is also referred to as *expansion*-factor, for a codeword length of the resulting LDPC code equal to
and code rate
.

A simple example of expansion or *vectorization* of a base-matrix is shown in Figure 1. The size, number, and location of the nonnull blocks in the code are the key parameters to get good error-correction performance and low-complexity of the related decoder.

## 3. Decoding of LDPC Codes

LDPC codes are decoded with the *belief propagation* (BP) or *message-passing* (MP) algorithm, that belong to the broader class of maximum *a posteriori* (MAP) algorithms. The BP algorithm has been proved to be optimal if the graph of the code does not contain cycles, but it can still be used and considered as a reference for practical codes with cycles. In the latter case, the sequence of the elaborations, also referred to as *schedule*, considerably affects the achievable performance.

The most common schedule for BP is the so-called two-phase or flooding schedule (FS) [18], where all parity-check nodes first, followed by all variable nodes then, are updated in sequence.

A different approach, taking the distribution of closed paths and girths in the code into account, has been described by Xiao and Banihashemi in [19]. Although *probabilistic* schedules are shown to outperform *deterministic* schedules, the random activation strategy of the processing nodes is not very suitable to HW implementation and adds significant complexity overheads.

The most attractive schedule is the *shuffled* or *layered* decoding [6, 9, 18, 20]. Compared to the FS, the layered schedule almost doubles the decoding convergence speed, both for codes with cycles and cycle-free [20]. This is achieved by looking at the code as a connection of smaller supercodes [6] or *layers* [9], exchanging intermediate reliability messages. Specifically, *a posteriori* messages are made available to the next layers immediately after computation and not at next iteration as in a conventional flooding schedule.

Layers can be any set of either CNs or VNs, and, accordingly, CN-*centric* (or horizontal) or VN-*centric* (or vertical) algorithms have been analyzed in [18, 20]. However, CN-*centric* solutions are preferable since they can exploit serial, flexible, and low-complexity CN processors.

**Algorithm 1:** Horizontal layered decoding.

**output**: a-Posteriori hard-decisions

(
) // *Messages initialization*

(
) **while** (
& !Convergence) **do**

In Algorithm 1,
is the
th *a priori* LLR of the received bits, with
and
the length of the codeword,
is the overall number of parity-check constraints, and
the number of decoding iterations. Also,
is the set of VNs connected to the
th CN,
represents the check-to-variable (*c2v*) reliability message sent from CN
to VN
at iteration
, and
is the total information or *soft-output* (SO) of the
th bit in the codeword (see Figure 1).

For the sake of an easier notation, it is assumed here that a layer corresponds to a single row of the parity-check matrix. Before being used by the next CN or layer, SOs are refined with the involved *c2v* message, as shown in line 13, and thanks to this mechanism, faster convergence is achieved.

*c2v*messages; specifically, if we define

## 4. Decoding Pipelining and Idling

If the decoder works in pipeline, time is saved by overlapping the phases of elaboration, writing-out and reading, so that data are continuously read from and written into memory, and a new layer is processed every clock cycles (see Figure 3(b)).

Although highly desirable, the pipeline mechanism is particularly challenging in a layered LDPC decoder, since the soft-outputs retrieved from memory and used for the current elaboration could not be always up-to-date, but newer values could be still in the pipeline. This issue, known as pipeline hazard, prevents the use and so the propagation of always up-to-date messages and spoils the error-correction performance of the decoding algorithm.

The solution investigated in this paper is to insert *null* or *idle* cycles between consecutive updates, so that a node processor is suspended to wait for newer data. The number of idle cycles must be kept as small as possible since it affects the iteration time and so the decoding throughput. Its value depends on the actual sequence of layers updated by the decoder as well as on the order followed to update messages within a layer.

Three different strategies are described in this section, to reduce the dependence between consecutive updates in the HLD algorithm and, accordingly, the number of idle cycles. These differ in the order followed for acquisition and writing-out of the decoding messages and constitute a powerful tool for the design of "layered'', hazard-free, LDPC codes.

### 4.1. System Notation

Without any lack of generality, let us identify a layer with one single parity-check node and focusing on the set of soft-outputs participating to layer , let us define the following subsets:

(i) , the set of SOs in common with layer ;

(ii) , the set of SOs in common with layer and not in ;

(iii) , the set of SOs in common with both layers and ;

(iv) , the set of SOs in common with layer and not in or ;

(v) , the set of SOs in common with layer but not in , , ;

(vi) , the set of SOs in common with both layers and , but not in or ;

(vii) , the set of remaining SOs.

In the definitions above the notation means the relative complement of in or the set-theoretic difference of and . Let us also define the following cardinalities: (degree of layer ), , , , , , , .

### 4.2. Equal Output Processing

First, let us consider a very straightforward and implementation friendly architecture of the node processor that updates (and so delivers) the soft-output messages with the same order used to take them in.

In such a case it would be desirable to (i) postpone the acquisition of messages updated by the previous layer, that is, messages in , and (ii) output the messages in as soon as possible to let the next layer start earlier. Actually, the last constraint only holds when does not include any message common to layer , that is, when ; otherwise, the set could be acquired at any time before .

Note that (5) and (6) only hold under the hypothesis of leading within . If this is not the case, up to extra idle cycles could be added if is output last within .

So far, we have only focused on the interaction between two consecutive layers; however, violations could also arise between layer and . Despite this possibility, this issue is not treated here, as it is typically mitigated by the same idle cycles already inserted between layers and and between layers and .

### 4.3. Reversed Output Processing

Depending on the particular structure of the parity-check matrix , it may occur that the most of the messages of layer in common with layer are also shared with layer , that is, and . If this condition holds, as for the WLAN LDPC codes (see Figure 11), it can be worth reversing the output order of SOs so that the messages in can be both acquired last and output first.

Similarly to EOP, the ROP strategy also suffers from pipeline hazards between three consecutive layers, and because of the reversed output order, the issue is more relevant now. This situation is sketched in Figure 5(b), where the sets , , and are managed similarly to , and . The ROP strategy is then instructed to acquire the set later and to output earlier. However, the situation is complicated by the fact that the set may not entirely coincide with ; rather it is , since some of the messages in can be found in . This is highlighted in Figure 5(b), where those messages of and not delivered to are shown in dark grey.

The margin is actually nonnull only if ; otherwise, under the hypothesis that (i) the set is output first within , and (ii) within , the messages not in are output last.

### 4.4. Unconstrained Output Processing

Fewer idle cycles are expected if the orders used for input and output are not constrained to each other. This implies that layer can still delay the acquisition of the messages updated by layer (i.e., messages in ) as usual, but at the same time the messages common to layer (i.e., in ) can also be delivered earlier.

Regarding the interaction between three consecutive layers, if the messages common to layer (i.e., in ) are output just after , and if on layer , the set is taken just before , then there is no risk of pipeline hazard between layer and .

### 4.5. Decoding of Irregular Codes

A serial processor cannot process consecutive layers with decreasing degrees, , as the pipeline of the internal elaborations would be corrupted and the output messages of the two layers would overlap in time. This is not but another kind of pipeline hazard, and again, it can be solved by delaying the update of the second layer with idle cycles.

### 4.6. Optimal Sequence of Layers

where is the number of idle cycles between layer and for the generic permutation and is given by (14), and is the set of the possible permutations of layers.

The minimization problem in (15) can be solved by means of a brute-force computer search and results in the definition of a permuted parity-check matrix , whose layers are scrambled according to the optimal permutation . Then, within each layer of , the order to update the nonnull subblocks is given by the strategy in use among EOP, ROP, and UOP.

### 4.7. Summary and Results

The three methods proposed in this section are differently effective to minimize the overall time spent in idle. Although UOP is expected to yield the smallest latency, the results strongly depend on the considered LDPC code, and ROP and EOP can be very close to UOP. As a case-example, results will be shown in Section 7 for the WLAN LDPC codes.

However, the effectiveness of the individual methods must be weighed up in view of the requirements of the underlying decoder architecture and the costs of its hardware implementation, which is the objective of Section 5. Thus, UOP generally requires bigger complexity in hardware, and EOP or ROP can be preferred for particular codes.

## 5. Decoder Architectures

Low complexity and high throughput are key features demanded to every competitive LDPC decoder, and to this extent, semi-parallel architectures are widely recognised as the best design choice.

As shown in [6, 8, 12] to mention just a few, a semi-parallel architecture includes an array of processing elements with size usually equal to the expansion factor
of the base-matrix
. Therefore, the HLD algorithm described in Section 3 must be intended in a vectorized form as well, and in order to exploit the code structure, a layer counts
consecutive parity-check nodes. Layers (in the number of
) are updated in sequence by the
check-node units (CNUs), and an array of
SOs (
) and of
*c2v* messages (
) are concurrently updated at every clock cycle. Since the parity-check equations in a layer are independent by construction, that is, they do not share SOs, the analysis of Section 4 still holds in a vectorized form.

The CNUs are designed to serially update the *c2v* magnitudes according to (3) and (4), and any arbitrary order of the *c2v* messages (and so of SOs, see line 13 of Algorithm 1) can be easily achieved by properly multiplexing between the two values as also shown in [23]. It must be pointed out that the 2-output approximation described in Section 3 is pivotal to a low-complexity implementation of EOP, ROP, or UOP in the CNU. However, the same strategies could also be used with a different (or even no) approximation in the CNU, although the cost of the related implementation would probably be higher.

Three VLSI architectures of a layered decoder will be described, that differ in the management of the memory units of both SO and *c2v*, and so result in different implementation costs in terms of memory (RAM and ROM) and logic.

### 5.1. Local Variable-to-Check Buffer

*v2c*messages entering the CNUs during the update of layer , are computed on-the-fly as with , and both the arrays of

*c2v*and SO messages are retrieved from memory.

Then, the updated *c2v* messages are used to refine every array of SOs belonging to layer
: according to line 13 of Algorithm 1, this is done by adding the new *c2v* array
to the input *v2c* array
. Since the CNUs work in pipeline, while the update of layer
is still progress, the array of the *v2c* messages belonging to layer
is already being computed as
, with
. For this reason,
needs to be temporarily stored in a local buffer as shown in Figure 7. The buffer is vectorized as well and stores
messages, with
the maximum CN degree in the code.

Before being stored back in memory, the array
is circularly shifted and made ready for its next use, by applying *compound* or *incremental* rotations [12]; this operation is carried out by the *circular* shifting network of Figure 7, and more details about its architecture are available in [24].

The *v2c* buffer is the key element that allows the architecture to work in pipeline. This has to sustain one reading and one writing access concurrently and can be efficiently implemented with shift-register based architectures for EOP (first-in, first-out, FIFO buffer) and ROP (last-in, first-out, LIFO buffer). On the contrary, UOP needs to map the buffer onto a *dual*-port memory bank, whose (reading) address is provided by and extra configuration memory (ROM).

### 5.2. Double Memory Access

*v2c*messages are computed twice on-the-fly, as shown in Figure 8: the first time to feed the array of CNUs, and then to update the SOs. To this aim, a further reading is required to get the arrays and from memory, and so recompute the array on the CNUs output.

It follows that *three*-port memories are needed for both SO and *c2v* messages since three concurrent accesses have to be supported: two readings (see ports
and
in Figure 8) and one writing. This memory can be implemented by distributing data on several banks of customary *dual*-port memory, in such a way that two readings always involve different banks. Actually, in a layered decoder a same memory location needs to be accessed several times per iteration and concurrently to several other data, so that resorting to only two memory banks would be unfeasible. On the other hand, the management of a higher number of banks would add a significant overhead to the complexity of the whole design.

The most trivial and expensive solution is achieved when both banks are a full copy or a *mirror* of the original memory as in [11], which corresponds to
redundancy. Conversely to this route, data can be selectively assigned to the two banks through computer search aiming at a minimum redundancy.

Roughly speaking, if we denote by
the cardinality of the set of data (SO or *c2v* messages) read concurrently to the
th data for
, then the higher
is (for a given
), the higher is the expected redundancy. So, a small redundancy
is experienced by the *c2v* memory, since each *c2v* message can collide with at most two other data (i.e.,
), while a higher redundancy
is associated to the SO memory, since every SO can face up to
conflicts, with
being the degree of the
th variable node, typically greater than
(especially for low-rate codes).

Indeed, the issue of memory partitioning and the reordering techniques described in Section 4 are linked to each other: whenever the CNUs are in idle, only one reading is performed. Therefore, an overall system optimization aiming at minimizing the iteration latency and the amount of memory redundancy at the same time could be pursued; however, due to the huge optimization space, this task is almost unfeasible and is not considered in this work.

### 5.3. Storage of Variable-to-Check Messages

During the elaboration of a generic layer, a certain *v2c* message is needed twice, and a local buffer or multiple memory reading operations were implemented in Arch. V-A and Arch. V-B, respectively.

*v2c*messages only once per iteration, like in Arch. V-A, but instead of using a local buffer, the

*v2c*messages are precomputed and stored in the SO memory ready for the next use, as sketched in Figure 10. A similar architecture is used in [10, 16] but the issue of decoding pipeline is not clearly stated there.

In this way, the SO memory turns into a *v2c* memory with the following meaning: the array
updated by layer
is stored in memory after marginalization with the *c2v* message
, with
being the index of the next layer reusing the same array of SOs,
. In other words, the array of *v2c* messages involved in the next update of the same block-column
is precomputed. Therefore, the data stored in the *v2c* memory are used twice, first to feed the array of CNUs, and then for the SOs update.

Similarly to Arch. V-B, a *three*-port memory would be required because of the decoding pipeline; the same considerations of Section 5.2 still hold, and an optimum partitioning of the *v2c* memory onto two banks with some redundancy can be found. Note that, as opposed to Arch. V-B, a customary dual-port memory is enough for *c2v* messages.

As far as the complexity is concerned, at first glance this solution seems to be preferable to Arch. V-B since it needs only two stages of parallel adders while the *c2v* memory is not split. However, the management of the reading ports of the *v2c* memory introduces significant overheads, since after the update of the soft outputs
by layer
, the memory controller must be aware of what is the next layer
using the same soft outputs
. This information needs to be stored in a dedicated configuration memory, whose size and area can be significant, especially in a multilength, multirate decoder.

## 6. A Case Study: The IEEE 802.11n LDPC Codes

### 6.1. LDPC Code Construction

The WLAN standard [3] defines AA-LDPC codes based on *circulants* of the identity matrix. Three different codeword lengths are supported,
,
and
, each coming with four code rates,
,
,
and
, for a total of
different codes. As a distinguishing feature, a different block-size is used for each codeword length, that is,
,
and
, respectively; accordingly, every code counts
*block*-columns, while the *block*-rows (layers) are in the number of
for code rates
,
,
and
, respectively.

An example of the base-matrix for the code with length and rate is shown in Figure 11.

### 6.2. Multiframe Decoder Architecture

In order to attain an adequate throughput for every WLAN codes, the decoder must include a number of CNUs at least equal to . This means that two thirds of the processors would remain unused with the shortest codes.

In the latter case, the throughput can be increased thanks to a *multiframe* approach, where
frames of the code with block-size
are decoded in parallel. A similar solution is described in [12], but in that case two different frames are decoded in time-division multiplexing by exploiting the 2 nonoverlapped phases of the flooding algorithm. Here,
frames are decoded concurrently, and more specifically, three different frames of the shortest code can be assigned to a cluster of 27 CNUs each.

Note that to work properly, the circular shifting network must support concurrent subrotations as described in [24].

## 7. Decoder Performance

### 7.1. Latency and Throughput

with being the clock period, being the number of iterations, being the number of nonnull blocks in the code, being the number of idle cycles per iteration, being the cycles to empty the decoder pipelin and finally, being the cycles for the input/output interface. Among the parameters above, is set for good error-correction performance, is a code-dependent parameter, and is fixed by the I/O management; thus, for a minimum latency, the designer can only act on , whose value can be optimised with the techniques of Section 4.

Code lenght | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Code rate | |||||||||||||

Original | 2299 | 1763 | 1779 | 1486 | 2106 | 1715 | 1886 | 1653 | 2107 | 1775 | 1752 | 1603 | |

91 | 46 | 47 | 22 | 81 | 46 | 60 | 43 | 77 | 47 | 48 | 41 | ||

idling % | 47% | 31% | 31% | 17% | 46% | 32% | 38% | 31% | 44% | 31% | 32% | 30% | |

101 | 176 | 197 | 262 | 74 | 121 | 124 | 157 | 111 | 175 | 200 | 243 | ||

EOP | 1927 | 1691 | 1575 | 1462 | 1819 | 1643 | 1527 | 1377 | 1855 | 1691 | 1538 | 1352 | |

60 | 40 | 30 | 20 | 57 | 40 | 30 | 20 | 56 | 40 | 30 | 20 | ||

idling % | 37% | 28% | 23% | 16% | 37% | 29% | 23% | 17% | 36% | 28% | 23% | 17% | |

121 | 184 | 222 | 266 | 85 | 126 | 153 | 188 | 126 | 184 | 228 | 288 | ||

ROP | 1308 | 1216 | 1290 | 1403 | 1223 | 1168 | 1239 | 1330 | 1283 | 1228 | 1243 | 1305 | |

8 | 0 | 6 | 15 | 7 | 0 | 6 | 16 | 8 | 1 | 5 | 16 | ||

idling % | 7.3% | 0% | 5.5% | 13% | 6.8% | 0% | 5.5% | 14% | 7.4% | 1% | 4.8% | 14% | |

178 | 256 | 271 | 277 | 127 | 178 | 188 | 195 | 182 | 253 | 282 | 298 | ||

UOP | 1308 | 1216 | 1243 | 1380 | 1187 | 1168 | 1195 | 1260 | 1259 | 1216 | 1195 | 1164 | |

8 | 0 | 2 | 13 | 4 | 0 | 2 | 10 | 6 | 0 | 1 | 4 | ||

idling % | 7.3% | 0% | 1.9% | 11% | 4% | 0% | 2% | 9.3% | 5.6% | 0% | 0.9% | 4% | |

178 | 256 | 282 | 282 | 131 | 178 | 195 | 206 | 185 | 256 | 293 | 334 |

where is the number of frames decoded in parallel. For this reason, the figures of Table 1 for the short codes are very similar to those for the long codes ( ); on the contrary, the middle codes do not benefit from the same mechanism (i.e., ) and their throughput is scaled down by a factor 2/3.

The results of Table 1 are for every technique of Section 4 as well as for the original codes before optimization. Although EOP clearly outperforms the original codes, better results are achieved with ROP and UOP for the WLAN case example, where at most 14% and 11% of the decoding time are spent in idle, respectively. On average, the decoding time decreases from 7.6 to 6.7 ns with EOP and even to 5.3 ns with ROP and 5.1 ns with UOP. This behaviour can be explained by considering that for the WLAN codes the term found in (6) for EOP is significantly nonnull, while comparing (8) to (13), ROP and UOP basically differ for the term , which is negligible for the WLAN codes.

### 7.2. Error-Correction Performance

*full pipeline*mode (curves with empty markers).

As expected, the three strategies reach the reference curve of the HLD algorithm when properly idled. Then, in case of full pipeline ( ), the performance of EOP are spoiled, while ROP and UOP only pay about 0.6 and 0.3 dB, respectively. This means that the reordering has significantly reduced the dependence between layers and only few hazards arise without idle cycles.

Similarly to EOP, no received codeword is successfully decoded even at high SNRs (i.e., ) if the original code descriptors are simulated in full pipeline. This confirms once more the importance of idle cycles in a pipelined HLD decoding decoder and motivates the need of an optimization technique.

## 8. Implementation Results

The complexity of an LDPC decoder for IEEE 802.11n codes was derived through logical synthesis on a low-power 65 nm CMOS technology targeting
MHz. Every architecture of Section 5 was considered for implementation, each one supporting the three reordering strategies, for a total of 9 combinations. For good error correction performance, input LLRs and *c2v* messages were represented on 5 bits, while internal SO and *v2c* messages on 7 bits.

IEEE 802.11n LDPC decoder complexity analysis.

EOP | ROP | UOP | ||
---|---|---|---|---|

Arch. V-A | logic (Kgates) | 71.29 | 71.62 | 74.65 |

RAM bits | 61,722 | 61,722 | 61,722 | |

ROM bits | 23,159 | 23,159 | 40,788 | |

Arch. V-B | logic (Kgates) | 75.45 | 75.75 | 77.99 |

RAM bits | 53,622 | 54,837 | 57,024 | |

29.2% | 29.2% | 33.3% | ||

1.1% | 4.6% | 9.1% | ||

ROM bits | 36,582 | 36,582 | 51,849 | |

Arch. V-C | logic (Kgates) | 71.83 | 72.14 | 74.60 |

RAM bits | 53,217 | 53,217 | 53,784 | |

29.2% | 29.2% | 33.3% | ||

ROM bits | 34,508 | 34,508 | 43,553 |

Because of the partitioning of both the SO and the *c2v* memories, Arch. V-B needs more logic resources and more memory bits than Arch. V-C (both for data and configuration). The redundancy ratios
and
of the SO and *c2v* memory in Arch. V-B, respectively, and
of the *v2c* memory in Arch. V-C, are also reported in Table 2.

As a matter of fact, the three architectures are very similar in complexity and performance, and, for a given set of LDPC codes, the designer can select the most suitable solution by trading-off decoding latency and throughput at the system level, with the requirements of logic and memory in terms of area, speed, and power consumption at the technology level.

*channel*throughput .

State-of-the-art LDPC decoder implementations.

[this] | [ 7 ] | [ 10 ] | [ 25 ] | [ 26 ] | ||
---|---|---|---|---|---|---|

Technology | 65 nm CMOS | |||||

Algorithm | layered | flooding | layered | TDMP | flooding/layered | |

CPU arch. | serial | parallel | serial | parallel | serial | |

Nb. of CPUs | 81 | 1536 | 81 | 64 | 96 | |

Msg. width ( | 5 + 7 | 4 + 4 | 5 + 6 | 4 + 5 | 6 | |

Clock fr (MHz) | 240 | 64 | 500 | 125 | 333 | |

Rates | 1/2, 2/3, 3/4, 5/6 | |||||

Codeword length, N | 648, 1296, 1944 | 1024 | 648, 1296, 1944 | 2048 | ||

Codeword size, B | 27, 54, 81 | 1 | 27, 54, 81 | 64 | ||

79–88 | 4,33 | 79–88 | 96 | 76–88 | ||

Speed | 12 | 64 | 5 | 10 | 16 | |

262–401 | 1,024 | 541–1,618 | 640 | 177–999 | ||

Area | 100.7 (0.207) | 1750 (52.5) | 99.9 (1.85) | 220 (14.3) | 489.9 (2.964) | |

RAM bits | 56,376 | — | 55,344 | 51,680 | NA | |

Power consumption (W) | 0.162 | 0.69 | 0.238 | 0.787 | NA | |

1.103–1.306 | 0.231 | 1.361–1.521 | 0.417 | 1.01–1.31 | ||

33.7–51.5 | 10.5 | — | 123 | — |

which represents the average number of clock cycles to update one block of . In decoders based on serial functional units it is and the higher is, the less efficient is the architecture. Actually, can reach 1 only when the dependence between consecutive layers is solved at the code design level. This is the case of two WiMAX codes (specifically, class 1/2 and class 2/3B codes) which are hazard-free (or layered) "by construction'', thus explaining the very low value of achieved by [26]. However, [26] is as efficient as our design ( ) on the remaining nonlayered WiMAX codes, but the authors do not perform layered decoding on such codes.

For decoders with parallel processing units (see [7, 25]) the architectural efficiency becomes a measure of the parallelization used in the processing units and it can be expressed as with being the average check node degree. Indeed, in a two-phase decoder, the number of blocks can be equivalently defined as the overall number of exchanged messages, divided by the number of functional units. If E is the number of edges in the code, then , which is an index of the parallelization used in the processors.

with being the decoding energy and being the power consumption. The latter was estimated with Synopsys Power Compiler and was averaged out over three different SNRs (corresponding to different convergence speeds) and includes the power dissipated in the memory units (about of the total). In terms of energy, our design is more efficient than [25] and gets close to the parallel decoder in [7].

Since the design in [10] is for the same WLAN LDPC codes and implements a similar layered decoding algorithm with the same number of processing units, a closer inspection is compulsory. Thanks to the idle optimization, our solution is more efficient in terms of throughput, the saving in efficiency ranging from to . Then, although our design saves about 70 mW in power consumption with respect to [10], the related energy efficiency has not been included in Table 2 since the reference scenario used to estimate the power consumption (238 mW) was not clearly defined. Finally, although curves for error correction performance are not available in [10], penalties are expected in view of the smaller accuracy used to represent (5 bits) and SOs (6 bits) messages.

## 9. Conclusions

An effective method to counteract the pipeline hazards typical of block-serial layered decoders of LDPC codes has been presented in this paper. This method is based on the rearrangement of the decoding elaborations in order to minimize the number of idle cycles inserted between updates and resulted in three different strategies named equal, reversed, and unconstrained output (EOP, ROP, and UOP) processing.

Then, different semi-parallel VLSI architectures of a layered decoder for architecture-aware LDPC codes supporting the methods above have been described and applied to the design of a decoder for IEEE 802.11n LDPC codes.

The synthesis of the proposed decoder on a 65 nm low-power CMOS technology reached the clock frequency of 240 MHz, which corresponds to a net throughput ranging from 131 to 334 Mbps with UOP and 12 decoding iterations, outperforming similar designs.

This work has proved that the layered decoding algorithm can be extended with no modifications nor approximations to every LDPC code, despite the interconnections on its parity-check matrix, provided that idle cycles are used to maintain the dependencies between the updates in the algorithm.

Also, the paradigm of code-decoder codesign has been reinforced in this work, since not only the described techniques have shown to be very effective to counteract the pipeline hazards but also they provide at the same time useful guidelines for the design of good, hazard-free, LDPC codes. To this extent, it is then overcome the assumption that consecutive layers do not have to share soft-outputs, like the WiMAX class 1/2 and 2/3B codes do, thus leaving more room to the optimization of the code performance at the level of the code design.

## Authors’ Affiliations

## References

- Satellite digital video broadcasting of second generation (DVB-S2) ETSI Standard EN302307, February 2005Google Scholar
- IEEE Computer Society : Air Interface for Fixed and Mobile Broadband Wirelss Access Systems. IEEE Std 802.16eTM-2005, February 2006Google Scholar
- IEEE P802.11n TM /D1.06 Draft amendment to Standard for high throughput, 802.11 Working Group, November 2006Google Scholar
- Gallager R:
*Low-density parity-check codes, Ph.D. dissertation*. Massachusetts Institutes of Technology; 1960.Google Scholar - MacKay D, Neal R:
**Good codes based on very sparse matrices.***Proceedings of the 5th IMA Conference on Cryptography and Coding, 1995*Google Scholar - Mansour MM, Shanbhag NR:
**High-throughput LDPC decoders.***IEEE Transactions on Very Large Scale Integration (VLSI) Systems*2003,**11**(6):976-996.View ArticleGoogle Scholar - Blanksby A, Howland C:
**A 690-mW 1-Gb/s 1024-b, rate-1/2 lowdensity parity-check code decoder.***IEEE Journal of Solid-State Circuits*2002,**37**(3):404-412. 10.1109/4.987093View ArticleGoogle Scholar - Zhong H, Zhang T:
**Block-LDPC: a practical LDPC coding system design approach.***IEEE Transactions on Circuits and Systems I*2005,**52**(4):766-775.MathSciNetView ArticleGoogle Scholar - Hocevar DE:
**A reduced complexity decoder architecture via layered decoding of LDPC codes.***Proceedings of the IEEE Workshop on Signal Processing Systems (SISP '04), 2004*107-112.Google Scholar - Gunnam K, Choi G, Wang W, Yeary M:
**Multi-rate layered decoder architecture for block LDPC codes of the IEEE 802.11n wireless standard.***Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '07), May 2007*1645-1648.Google Scholar - Bhatt T, Sundaramurthy V, Stolpman V, McCain D:
**Pipelined block-serial decoder architecture for structured LDPC codes.***Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), April 2006***4:**225-228.Google Scholar - Fewer CP, Flanagan MF, Fagan AD:
**A versatile variable rate LDPC codec architecture.***IEEE Transactions on Circuits and Systems I*2007,**54**(10):2240-2251.View ArticleGoogle Scholar - Boutillon E, Tousch J, Guilloud F:
**LDPC decoder, corresponding method, system and computer program.**US patent no. 7,174,495 B2, February 2007Google Scholar - Rovini M, Rossi F, Ciao P, L'Insalata N, Fanucci L:
**Layered decoding of non-layered LDPC codes.***Proceedings of the 9th Euromicro Conference on Digital System Design (DSD '06), August-September 2006*Google Scholar - Tanner R:
**A recursive approach to low complexity codes.***IEEE Transactions on Information Theory*1981,**27**(5):533-547. 10.1109/TIT.1981.1056404MATHMathSciNetView ArticleGoogle Scholar - Zhang H, Zhu J, Shi H, Wang D:
**Layered approx-regular LDPC: code construction and encoder/decoder design.***IEEE Transactions on Circuits and Systems I*2008,**55**(2):572-585.MathSciNetView ArticleGoogle Scholar - Echard R, Chang S-C:
**The**-**rotation low-density parity check codes.***Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM '01), November 2001*980-984.View ArticleGoogle Scholar - Guilloud F, Boutillon E, Tousch J, Danger J-L:
**Generic description and synthesis of LDPC decoders.***IEEE Transactions on Communications*2006,**55**(11):2084-2091.View ArticleGoogle Scholar - Xiao H, Banihashemi AH:
**Graph-based message-passing schedules for decoding LDPC codes.***IEEE Transactions on Communications*2004,**52**(12):2098-2105. 10.1109/TCOMM.2004.838730View ArticleGoogle Scholar - Sharon E, Litsyn S, Goldberger J:
**Efficient serial message-passing schedules for LDPC decoding.***IEEE Transactions on Information Theory*2007,**53**(11):4076-4091.MathSciNetView ArticleGoogle Scholar - Zarkeshvari F, Banihashemi A:
**On implementation of min-sum algorithm for decoding low-density parity-check (LDPC) codes.***Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM '02), November 2002***2:**1349-1353.Google Scholar - Jones C, Valles E, Smith M, Villasenor J:
**Approximate-MIN constraint node updating for LDPC code decoding.***Proceedings of the IEEE Military Communications Conference (MILCOM '03), October 2003***1:**157-162.Google Scholar - Rovini M, Rossi F, L'Insalata N, Fanucci L:
**High-precision LDPC codes decoding at the lowest complexity.***Proceedings of the 14th European Signal Processing Conference (EUSIPCO '06), September 2006*Google Scholar - Rovini M, Gentile G, Fanucci L:
**Multi-size circular shifting networks for decoders of structured LDPC codes.***Electronics Letters*2007,**43**(17):938-940. 10.1049/el:20071157View ArticleGoogle Scholar - Mansour MM, Shanbhag NR:
**A 640-Mb/s 2048-bit programmable LDPC decoder chip.***IEEE Journal of Solid-State Circuits*2006,**41**(3):684-698. 10.1109/JSSC.2005.864133View ArticleGoogle Scholar - Brack T, Alles M, Kienle F, Wehn N:
**A synthesizable IP core for WiMax 802.16E LDPC code decoding.***Proceedings of the 17th IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC '06), September 2006*1-5.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.