The data-flow of a pipelined decoder with serial processing units is sketched in Figure 2. A centralized memory unit keeps the updated soft-outputs, computed by the node processors (NPs) according to Algorithm 1. If we denote with the number of nonnull blocks in layer , that is, the degree of layer , then the processor takes clock cycles to serially load its inputs. Then, refined values are written back in memory (after scrambling or permutation) with the latency of clock cycles, and this operation takes again clock cycles. Overall, the processing time of layer is then clock cycles, as shown in Figure 3(a).

If the decoder works in pipeline, time is saved by overlapping the phases of elaboration, writing-out and reading, so that data are continuously read from and written into memory, and a new layer is processed every clock cycles (see Figure 3(b)).

Although highly desirable, the pipeline mechanism is particularly challenging in a layered LDPC decoder, since the soft-outputs retrieved from memory and used for the current elaboration could not be always up-to-date, but newer values could be still in the pipeline. This issue, known as pipeline hazard, prevents the use and so the propagation of always up-to-date messages and spoils the error-correction performance of the decoding algorithm.

The solution investigated in this paper is to insert *null* or *idle* cycles between consecutive updates, so that a node processor is suspended to wait for newer data. The number of idle cycles must be kept as small as possible since it affects the iteration time and so the decoding throughput. Its value depends on the actual sequence of layers updated by the decoder as well as on the order followed to update messages within a layer.

Three different strategies are described in this section, to reduce the dependence between consecutive updates in the HLD algorithm and, accordingly, the number of idle cycles. These differ in the order followed for acquisition and writing-out of the decoding messages and constitute a powerful tool for the design of "layered'', hazard-free, LDPC codes.

### 4.1. System Notation

Without any lack of generality, let us identify a layer with one single parity-check node and focusing on the set of soft-outputs participating to layer , let us define the following subsets:

(i), the set of SOs in common with layer ;

(ii), the set of SOs in common with layer and not in ;

(iii), the set of SOs in common with both layers and ;

(iv), the set of SOs in common with layer and not in or ;

(v), the set of SOs in common with layer but not in , , ;

(vi), the set of SOs in common with both layers and , but not in or ;

(vii), the set of remaining SOs.

In the definitions above the notation means the relative complement of in or the set-theoretic difference of and . Let us also define the following cardinalities: (degree of layer ), , , , , , , .

### 4.2. Equal Output Processing

First, let us consider a very straightforward and implementation friendly architecture of the node processor that updates (and so delivers) the soft-output messages with the same order used to take them in.

In such a case it would be desirable to (i) postpone the acquisition of messages updated by the previous layer, that is, messages in , and (ii) output the messages in as soon as possible to let the next layer start earlier. Actually, the last constraint only holds when does not include any message common to layer , that is, when ; otherwise, the set could be acquired at any time before .

Figure 4 shows the I/O data streams of an equal output processing (EOP) unit. Here, is the latency of the SO data-path, including the elaboration in the NP, the scrambling, and the two memory accesses (reading and writing). Focusing on layer , the set cannot be assigned to any specific position within , since the whole is acquired according to the same order used by layer to output (and so also acquire) the sets and . For this reason, the situation plotted in Figure 4 is only for the sake of a clearer drawing.

With reference to Figure 4, pipeline hazards are cleared if idle cycles are spent between layer and so that

with for and otherwise. This means that if is empty, then the messages in do not need to be waited for. The solution to (5) with minimum latency is

Note that (5) and (6) only hold under the hypothesis of leading within . If this is not the case, up to extra idle cycles could be added if is output last within .

So far, we have only focused on the interaction between two consecutive layers; however, violations could also arise between layer and . Despite this possibility, this issue is not treated here, as it is typically mitigated by the same idle cycles already inserted between layers and and between layers and .

### 4.3. Reversed Output Processing

Depending on the particular structure of the parity-check matrix , it may occur that the most of the messages of layer in common with layer are also shared with layer , that is, and . If this condition holds, as for the WLAN LDPC codes (see Figure 11), it can be worth reversing the output order of SOs so that the messages in can be both acquired last and output first.

Figure 5(a) shows the I/O streams of a reversed output processing (ROP) unit. Exploiting the reversal mechanism, the set is acquired second-last, just before , so that it is available earlier for layer .

Following a reasoning similar to EOP, the situation sketched in Figure 5(a) where is delivered first within is just for an easier representation, and the condition for hazard-free layered decoding is now

Indeed, when , one could output first in , and so get rid of the term . However, since is actually left floating within , (7) represents again a best-case scenario, and up to extra idle cycles could be required. From (7), the minimum latency solution is

Similarly to EOP, the ROP strategy also suffers from pipeline hazards between three consecutive layers, and because of the reversed output order, the issue is more relevant now. This situation is sketched in Figure 5(b), where the sets , , and are managed similarly to , and . The ROP strategy is then instructed to acquire the set later and to output earlier. However, the situation is complicated by the fact that the set may not entirely coincide with ; rather it is , since some of the messages in can be found in . This is highlighted in Figure 5(b), where those messages of and not delivered to are shown in dark grey.

To clear the hazards between three layers, additional idle cycles are added in the number of

where is the acquisition margin on layer , and is the writing-out margin on layer . These can be computed under the assumption of no hazard between layer and (i.e., is aligned with thanks to as shown in Figure 5(b)) and are given by

The margin is actually nonnull only if ; otherwise, under the hypothesis that (i) the set is output first within , and (ii) within , the messages not in are output last.

Overall, the number of idle cycles of ROP is given by

### 4.4. Unconstrained Output Processing

Fewer idle cycles are expected if the orders used for input and output are not constrained to each other. This implies that layer can still delay the acquisition of the messages updated by layer (i.e., messages in ) as usual, but at the same time the messages common to layer (i.e., in ) can also be delivered earlier.

The input and output data streams of an unconstrained output processing (UOP) unit are shown in Figure 6. Now, hazard-free layered decoding is achieved when

which yields

Regarding the interaction between three consecutive layers, if the messages common to layer (i.e., in ) are output just after , and if on layer , the set is taken just before , then there is no risk of pipeline hazard between layer and .

### 4.5. Decoding of Irregular Codes

A serial processor cannot process consecutive layers with decreasing degrees, , as the pipeline of the internal elaborations would be corrupted and the output messages of the two layers would overlap in time. This is not but another kind of pipeline hazard, and again, it can be solved by delaying the update of the second layer with idle cycles.

Since this type of hazard is independent of that seen above, the same idle cycles may help to solve both issues. For this reason, the overall number of idle cycles becomes

with being computed according to (6), (11), or (13).

### 4.6. Optimal Sequence of Layers

For a given reordering strategy, the overall number of idle cycles per decoding iteration is a function of the actual sequence of layers used for the decoding. For a code with layers, the optimal sequence of layer minimizing the time spent in idle is given by

where is the number of idle cycles between layer and for the generic permutation and is given by (14), and is the set of the possible permutations of layers.

The minimization problem in (15) can be solved by means of a brute-force computer search and results in the definition of a permuted parity-check matrix , whose layers are scrambled according to the optimal permutation . Then, within each layer of , the order to update the nonnull subblocks is given by the strategy in use among EOP, ROP, and UOP.

### 4.7. Summary and Results

The three methods proposed in this section are differently effective to minimize the overall time spent in idle. Although UOP is expected to yield the smallest latency, the results strongly depend on the considered LDPC code, and ROP and EOP can be very close to UOP. As a case-example, results will be shown in Section 7 for the WLAN LDPC codes.

However, the effectiveness of the individual methods must be weighed up in view of the requirements of the underlying decoder architecture and the costs of its hardware implementation, which is the objective of Section 5. Thus, UOP generally requires bigger complexity in hardware, and EOP or ROP can be preferred for particular codes.