The data-flow of a pipelined decoder with serial processing units is sketched in Figure 2. A centralized memory unit keeps the updated soft-outputs, computed by the node processors (NPs) according to Algorithm 1. If we denote with
the number of nonnull blocks in layer
, that is, the degree of layer
, then the processor takes
clock cycles to serially load its inputs. Then, refined values are written back in memory (after scrambling or permutation) with the latency of
clock cycles, and this operation takes again
clock cycles. Overall, the processing time of layer
is then
clock cycles, as shown in Figure 3(a).
If the decoder works in pipeline, time is saved by overlapping the phases of elaboration, writing-out and reading, so that data are continuously read from and written into memory, and a new layer is processed every
clock cycles (see Figure 3(b)).
Although highly desirable, the pipeline mechanism is particularly challenging in a layered LDPC decoder, since the soft-outputs retrieved from memory and used for the current elaboration could not be always up-to-date, but newer values could be still in the pipeline. This issue, known as pipeline hazard, prevents the use and so the propagation of always up-to-date messages and spoils the error-correction performance of the decoding algorithm.
The solution investigated in this paper is to insert null or idle cycles between consecutive updates, so that a node processor is suspended to wait for newer data. The number of idle cycles must be kept as small as possible since it affects the iteration time and so the decoding throughput. Its value depends on the actual sequence of layers updated by the decoder as well as on the order followed to update messages within a layer.
Three different strategies are described in this section, to reduce the dependence between consecutive updates in the HLD algorithm and, accordingly, the number of idle cycles. These differ in the order followed for acquisition and writing-out of the decoding messages and constitute a powerful tool for the design of "layered'', hazard-free, LDPC codes.
4.1. System Notation
Without any lack of generality, let us identify a layer with one single parity-check node and focusing on the set
of soft-outputs participating to layer
, let us define the following subsets:
(i)
, the set of SOs in common with layer
;
(ii)
, the set of SOs in common with layer
and not in
;
(iii)
, the set of SOs in common with both layers
and
;
(iv)
, the set of SOs in common with layer
and not in
or
;
(v)
, the set of SOs in common with layer
but not in
,
,
;
(vi)
, the set of SOs in common with both layers
and
, but not in
or
;
(vii)
, the set of remaining SOs.
In the definitions above the notation
means the relative complement of
in
or the set-theoretic difference of
and
. Let us also define the following cardinalities:
(degree of layer
),
,
,
,
,
,
,
.
4.2. Equal Output Processing
First, let us consider a very straightforward and implementation friendly architecture of the node processor that updates (and so delivers) the soft-output messages with the same order used to take them in.
In such a case it would be desirable to (i) postpone the acquisition of messages updated by the previous layer, that is, messages in
, and (ii) output the messages in
as soon as possible to let the next layer start earlier. Actually, the last constraint only holds when
does not include any message common to layer
, that is, when
; otherwise, the set
could be acquired at any time before
.
Figure 4 shows the I/O data streams of an equal output processing (EOP) unit. Here,
is the latency of the SO data-path, including the elaboration in the NP, the scrambling, and the two memory accesses (reading and writing). Focusing on layer
, the set
cannot be assigned to any specific position within
, since the whole
is acquired according to the same order used by layer
to output (and so also acquire) the sets
and
. For this reason, the situation plotted in Figure 4 is only for the sake of a clearer drawing.
With reference to Figure 4, pipeline hazards are cleared if
idle cycles are spent between layer
and
so that
with
for
and
otherwise. This means that if
is empty, then the messages in
do not need to be waited for. The solution to (5) with minimum latency is
Note that (5) and (6) only hold under the hypothesis of
leading within
. If this is not the case, up to
extra idle cycles could be added if
is output last within
.
So far, we have only focused on the interaction between two consecutive layers; however, violations could also arise between layer
and
. Despite this possibility, this issue is not treated here, as it is typically mitigated by the same idle cycles already inserted between layers
and
and between layers
and
.
4.3. Reversed Output Processing
Depending on the particular structure of the parity-check matrix
, it may occur that the most of the messages of layer
in common with layer
are also shared with layer
, that is,
and
. If this condition holds, as for the WLAN LDPC codes (see Figure 11), it can be worth reversing the output order of SOs so that the messages in
can be both acquired last and output first.
Figure 5(a) shows the I/O streams of a reversed output processing (ROP) unit. Exploiting the reversal mechanism, the set
is acquired second-last, just before
, so that it is available earlier for layer
.
Following a reasoning similar to EOP, the situation sketched in Figure 5(a) where
is delivered first within
is just for an easier representation, and the condition for hazard-free layered decoding is now
Indeed, when
, one could output
first in
, and so get rid of the term
. However, since
is actually left floating within
, (7) represents again a best-case scenario, and up to
extra idle cycles could be required. From (7), the minimum latency solution is
Similarly to EOP, the ROP strategy also suffers from pipeline hazards between three consecutive layers, and because of the reversed output order, the issue is more relevant now. This situation is sketched in Figure 5(b), where the sets
,
, and
are managed similarly to
,
and
. The ROP strategy is then instructed to acquire the set
later and to output
earlier. However, the situation is complicated by the fact that the set
may not entirely coincide with
; rather it is
, since some of the messages in
can be found in
. This is highlighted in Figure 5(b), where those messages of
and
not delivered to
are shown in dark grey.
To clear the hazards between three layers, additional idle cycles are added in the number of
where
is the acquisition margin on layer
, and
is the writing-out margin on layer
. These can be computed under the assumption of no hazard between layer
and
(i.e.,
is aligned with
thanks to
as shown in Figure 5(b)) and are given by
The margin
is actually nonnull only if
; otherwise,
under the hypothesis that (i) the set
is output first within
, and (ii) within
, the messages not in
are output last.
Overall, the number of idle cycles of ROP is given by
4.4. Unconstrained Output Processing
Fewer idle cycles are expected if the orders used for input and output are not constrained to each other. This implies that layer
can still delay the acquisition of the messages updated by layer
(i.e., messages in
) as usual, but at the same time the messages common to layer
(i.e., in
) can also be delivered earlier.
The input and output data streams of an unconstrained output processing (UOP) unit are shown in Figure 6. Now, hazard-free layered decoding is achieved when
which yields
Regarding the interaction between three consecutive layers, if the messages common to layer
(i.e., in
) are output just after
, and if on layer
, the set
is taken just before
, then there is no risk of pipeline hazard between layer
and
.
4.5. Decoding of Irregular Codes
A serial processor cannot process consecutive layers with decreasing degrees,
, as the pipeline of the internal elaborations would be corrupted and the output messages of the two layers would overlap in time. This is not but another kind of pipeline hazard, and again, it can be solved by delaying the update of the second layer with
idle cycles.
Since this type of hazard is independent of that seen above, the same idle cycles may help to solve both issues. For this reason, the overall number of idle cycles becomes
with
being computed according to (6), (11), or (13).
4.6. Optimal Sequence of Layers
For a given reordering strategy, the overall number of idle cycles per decoding iteration is a function of the actual sequence of layers used for the decoding. For a code with
layers, the optimal sequence of layer
minimizing the time spent in idle is given by
where
is the number of idle cycles between layer
and
for the generic permutation
and is given by (14), and
is the set of the possible permutations of layers.
The minimization problem in (15) can be solved by means of a brute-force computer search and results in the definition of a permuted parity-check matrix
, whose layers are scrambled according to the optimal permutation
. Then, within each layer of
, the order to update the nonnull subblocks is given by the strategy in use among EOP, ROP, and UOP.
4.7. Summary and Results
The three methods proposed in this section are differently effective to minimize the overall time spent in idle. Although UOP is expected to yield the smallest latency, the results strongly depend on the considered LDPC code, and ROP and EOP can be very close to UOP. As a case-example, results will be shown in Section 7 for the WLAN LDPC codes.
However, the effectiveness of the individual methods must be weighed up in view of the requirements of the underlying decoder architecture and the costs of its hardware implementation, which is the objective of Section 5. Thus, UOP generally requires bigger complexity in hardware, and EOP or ROP can be preferred for particular codes.