This section describes our first hardware implementation of MT19937 which we call the single port version. Generation of random numbers is carried out in 3 stages, namely, the seed generator, seed value modulator, and output generator. This is illustrated in Figure 1.
Typically the user provides one number as a seed; however, the MT19937 algorithm works with a pool of 624 seeds so that generator stage generates 624 seeds from the single input from the user. In stage two (the seed value modulator), which is the core of the algorithm, three values seed
, seed
, and seed
are read from the pool and based on the computation defined in the algorithm; seed
is updated. In the final stage, the output generator reads one of the pool values and generates the output uniform random number from this value.
The logic used to generate values out of stages 2 and 3 is shown in Figure 2. The simplest form of parallelism for MT19937 is to perform stages 2 and 3 in parallel, and this is illustrated in Figure 2. Note that it is not possible to more finely pipeline the output generator because its processing rate is tied to the seed value modulator, which can only be pipelined into 3 stages. In other words, the seed value modulator is a bottleneck in the design. It needs to be pointed out that if the data comes from a dual port BRAM only one value can be read and one written in the same clock cycle. Since we need three values to be read, we use 3 dual port BRAMs. We then need logic to decide which BRAM to write into. The write back selection logic forms another stage in the seed value modulator, which now has 4 stages. Not shown in Figure 1 is the logic by which the BRAM address will be read from and written to. The single port version generates one new random number per clock cycle. In Figure 2, mag1, mag2, and the hex numbers are constants given in the algorithm definition.
The single port version provided is similar to the software implementation of the MT19937 algorithm as it does not provide any significant parallelization in the generation of the seeds. The only parallelism that is exploited is in the concurrent execution of seed value modulator (stage 2) and output generator (stage 3). It was also found that it was not possible to pipeline the output generator to more than 3 stages as it was tied to the seed value modulator. Significant improvements in throughput could be achieved by the parallelization of the stages 2 and 3 in addition to executing them in parallel as shown above. However, the problem with parallelizing stages 2 and 3 is that currently the seeds are all stored in a single dual port BRAM. It is not possible to carry out multiple reads and multiple writes to a single BRAM in one clock cycle. Previously in [7] parallelization of both these stages was achieved by dividing the seeds into multiple BRAMs. This however significantly increased the area requirements of the design. In the next section we study this problem in more detail and present our new design that has a high throughput and is area efficient.