In this section, we describe an implementation of the backprojection image formation algorithm on a high-performance reconfigurable computer. Our implementation has been designed to provide high-speed image formation services and support output data distribution via a publish/subscribe [25] methodology. Section 4.1 describes the system on which our implementation runs. Section 4.2 explores the inherent parallelism in backprojection and describes the high-level design decisions that steered the implementation. Section 4.3 describes the portion of the implementation that runs in software, and Section 4.4 describes the hardware.
4.1. System Background
In Section 2.2, we described the HHPC system. In this section, we will explore more deeply the aspects of that system that are relevant to our experimental design.
4.1.1. HHPC Features
Several features of the Annapolis WildStar II FPGA boards are directly relevant to the design of our backprojection implementation. In particular, the host-to-FPGA interface, the on-board memory bandwidth, and the available features of the FPGA itself guided our design decisions.
Communication between the host GPP and the WildStar II board is over a PCI bus. The HHPC provides a PCI bus that runs at 66 MHz with 64-bit datawords. The WildStar II on-board PCI interface translates this into a 32-bit interface running at 133 MHz. By implementing the DMA data transfer mode to communicate between the GPP and the FPGA, the on-board PCI interface performs this translation invisibly and without significant loss of performance. A 133 MHz clock is also a good and achievable clock rate for FPGA hardware, so most of the hardware design can be run directly off the PCI interface clock. This simplifies the design since there are fewer clock domains (see Section 4.4.1).
The WildStar II board has six on-board SRAM memories (1 MB each) and one SDRAM memory (64 MB). It is beneficial to be able to read one datum and write one datum in the same clock cycle, so we prefer to use multiple SRAMs instead of the single larger SDRAM. The SRAMs run at 50 MHz and feature a 32-bit dataword (plus four parity bits), but they use a DDR interface. The Annapolis controller for the SRAM translates this into a 50 MHz 72-bit interface. Both features are separately important: we will need to cross from the 50 MHz memory clock domain to the 133 MHz PCI clock domain, and we will need to choose the size of our data such that they can be packed into a 72-bit memory word (see Section 4.2.4).
Finally, the Virtex2 6000 FPGA on the Wildstar II has some useful features that we use to our advantage. A large amount of on-chip memory is available in the form of BlockRAMs, which are configurable in width and depth but can hold at most 2 KB of data each. One hundred forty four of these dual-ported memories are available, each of which can be accessed independently. This makes BlockRAMs a good candidate for storing and accessing input projection data (see Sections 4.2.4 and 4.4.3.) BlockRAMs can also be configured as FIFOs, and due to their dual-ported nature, can be used to cross clock domains.
4.1.2. Swathbuckler Project
This project was designed to fit in as part of the Swathbuckler project [26–28], an implementation of synthetic aperture radar created by a joint program between the American, British, Canadian, and Australian defense research project agencies. It encompasses the entire SAR process including the aircraft and radar dish, signal capture and analog-to-digital conversion, filtering, and image formation hardware and software.
Our problem as posed was to increase the processing capabilities of the HHPC by increasing the performance of the portions of the application seen on the right-hand side of Figure 1. Given that a significant amount of work had gone into tuning the performance of the software implementation of the filtering process [26], it remained for us to improve the speed at which images could be formed. According to the project specification, the input data are streamed into the microprocessor main memory. In order to perform image formation on the FPGA, it is then necessary to copy data from the host to the FPGA. Likewise, the output image must be copied from the FPGA memory to the host memory so that it can be made accessible to the publish/subscribe software. These data transfer times are included in our performance measurements (see Section 5).
4.2. Algorithm Analysis
In this section, we dissect the backprojection algorithm with an eye toward implementing it on an HPRC machine. There are many factors that need to be taken into account when designing an HPRC application. First and foremost, an application that does not have a high degree of parallelism is generally not a good candidate. Given a suitable application, we then decide how to divide the problem along the available levels of parallelism in order to determine what part of the application will be executed on each available processor. This includes GPP/FPGA assignment as well as dividing the problem across the multiple nodes of the cluster. For the portions of the application run on the FPGAs, data arrays must be distributed among the accessible memories. Next, we look at some factors to improve the performance of the hardware implementation, namely, data formats and computation strength reduction. We conclude by examining the parameters of the data collection process that affect the computation.
4.2.1. Parallelism Analysis
In any reconfigurable application design, performance gains due to implementation in hardware inevitably come from the ability of reconfigurable hardware (and, indeed, hardware in general) to perform multiple operations at once. Extracting the parallelism in an application is thus critical to a high-performance implementation.
Equation (1) shows the backprojection operation in terms of projection data
and an output image
. That equation may be interpreted to say that for a particular pixel
, the final value can be found from a summation of contributions from the set of all projections
whose corresponding radar pulse covered that ground location. The value of
for a given
is determined by the mapping function
according to (2). There is a large degree of parallelism inherent in this interpretation.
-
(1)
The contribution from projection
to pixel
is not dependent on the contributions from all other projections
to that same pixel
.
-
(2)
The contribution from projection
to pixel
is not dependent on the contribution from
to all other pixels
.
-
(3)
The final value of a pixel is not dependent on the value of any other pixel in the target image.
It can be said, therefore, that backprojection is an "embarrassingly parallel" application, which is to say that it lacks any data dependencies. Without data dependencies, the opportunity for parallelism is vast and it is simply a matter of choosing the dimensions along which to divide the computation that best matches the system on which the algorithm will be implemented.
4.2.2. Dividing the Problem
There are two ways in which parallel applications are generally divided across the nodes of a cluster.
-
(1)
Split the data. In this case, each node performs the same computation as every other node, but on a subset of data. There may be several different ways that the data can be divided.
-
(2)
Split the computation. In this case, each node performs a portion of the computation on the entire dataset. Intermediate sets of data flow from one node to the next. This method is also known as task-parallel or systolic computing.
While certain supercomputer networks may make the task-parallel model attractive, our work with the HHPC indicates that its architecture is more suited to the data-parallel mode. Since internode communication is accomplished over a many-to-many network (Ethernet or Myrinet), passing data from one node to the next as implied by the task-parallel model will potentially hurt performance. A task-parallel design also implies that a new FPGA design must be created for each FPGA node in the system, greatly increasing design and verification time. Finally, the number of tasks available in this application is relatively small and would not occupy the number of nodes that are available to us.
Given that we will create a data-parallel design, there are several axes along which we considered splitting the data. One method involves dividing the input projection data
among the nodes along the
dimension. Each node would hold a portion of the projections
and calculate that portion contribution to the final image. However, this implies that each node must hold a copy of the entire target image in memory, and furthermore, that all of the partial target images would need to be added together after processing before the final image could be created. This extra processing step would also require a large amount of data to pass between nodes. In addition, the size of the final image would be limited to that which would fit on a single FPGA board.
Rather than dividing the input data, the preferred method divides the output image
into pieces along the range (
) axis (see Figure 2). In theory, this requires that every projection be sent to each node; however, since only a portion of each projection will affect the slice of the final image being computed on a single node, only that portion must be sent to that node. Thus, the amount of input data being sent to each node can be reduced to
. We refer to the portion of the final target image being computed on a single node,
, as a "subimage".
Figure 2 shows that
is slightly beyond the time index that corresponds to
. This is due to the width of the cone-shaped radar beam. The dotted line in the figure shows a single radar pulse taken at slow-time index
. The minimum distance to any part of the subimage is at the point
, which corresponds to fast-time index
in the projection data. The maximum distance to any part of the subimage, however, is along the outer edge of the cone to the point
where
is a factor calculated from the beamwidth angle of the radar and
. Thus, the fast-time index
is calculated relative to
and
instead of simply
. This also implies that the
range for two adjacent nodes will overlap somewhat, or (equivalently) that some projection data will be sent to more than one node.
Since the final value of a pixel does not depend on the values of the pixels surrounding it, each FPGA needs hold only the subimage that it is responsible for computing. That portion is not affected by the results on any other FPGA, which means that the postprocessing accumulation stage can be avoided. If a larger target image is desired, subimages can be "stitched" together simply by concatenation.
In contrast to the method where input data are divided along the
dimension, the size of the final target image is not restricted by the amount of memory on a single node, and furthermore, larger images can be processed by adding nodes to the cluster. This is commonly referred to as coarse-grained parallelism, since the problem has been divided into large-scale independent units. Coarse-grained parallelism is directly related to the performance gains that are achieved by adapting the application from a single-node computer to a multinode cluster.
4.2.3. Memory Allocation
The memory devices used to store the input and output data on the FPGA board may now be determined. We need to store two large arrays of information: the target image
and the input projection data
. On the Wildstar II board, there are three options: an on-board DRAM, six on-board SRAMs, and a variable number of BlockRAMs which reside inside the FPGA and can be instantiated as needed. The on-board DRAM has the highest capacity (64 MB) but is the most difficult to use and only has one read/write port. BlockRAMs are the most flexible (two read/write ports and a flexible geometry) and simple to use, but have a small (2 KB) capacity.
For the target image, we would like to be able to both read and write one target pixel per cycle. It is also important that the size of the target image stored on one node be as large as possible, so memories with larger capacity are better. Thus, we will use multiple on-board SRAMs to store the target image. By implementing a two-memory storage system, we can provide two logical ports into the target image array. During any given processing step, one SRAM acts as the source for target pixels, and the other acts as the destination for the newly computed pixel values. When the next set of projections is sent to the FPGA, the roles of the two SRAMs are reversed.
Owing to the 1 MB size of the SRAMs in which we store the target image data, we are able to save
pixels. We choose to arrange this into a target image that is 1024 pixels in the azimuth dimension and 512 in the range dimension. Using power-of-two dimensions allows us to maximize our use of the SRAM, and keeping the range dimension small allows us to reduce the amount of projection data that must be transferred.
For the projection data, we would like to have many small memories that can each feed one of the projection adder units. BlockRAMs allow us to instantiate multiple small memories in which to hold the projection data; each memory has two available ports, meaning that two adders can be supported in parallel. Each adder reads from one SRAM and writes to another; since we can support two adders, we could potentially use four SRAMs.
4.2.4. Data Formats
Backprojection is generally accomplished in software using a complex (i.e., real and imaginary parts) floating-point format. However, since the result of this application is an image which requires only values from 0 to 255 (i.e., 8-bit integers), the loss of precision inherent in transforming the data to a fixed-point/integer format is negligible. In addition, using an integer data format allows for much simpler functional units.
Given an integer data format, it remains to determine how wide the various datawords should be. We base our decision on the word width of the memories. The SRAM interface provides 72 bits of data per cycle, comprised of two physical 32-bit datawords plus four bits of parity each. The BlockRAMs are configurable, but generally can provide power-of-two sized datawords.
Since backprojection is in essence an accumulation operation, it makes sense for the output data (target image pixels) to be wider than the input data (projection samples). This reduces the likelihood of overflow error in the accumulation. We, therefore, use 36-bit complex integers (18-bit real and 18-bit imaginary) for the target image, and 32-bit complex integers for the projection data.
After backprojection, a complex magnitude operator is needed to reduce the 36-bit complex integers to a single 18-bit real integer. This operator is implemented in hardware, but the process of scaling data from 18-bit integer to 8-bit image is left to the software running on the GPP.
4.2.5. Computation Analysis
The computation to be performed on each node consists of three parts. The summation from (1) and the distance calculation from (2) represent the backprojection work to be done. The complex magnitude operation is similar to the distance calculation.
While adders are simple to replicate in large numbers, the hardware required to perform multiplication and square root is more costly. If we were using floating-point data formats, the number of functional units that could be instantiated would be very small, reducing the parallelism that we can exploit. With integer data types, however, these units are relatively small, fast, and easily pipelined. This allows us to maintain a high clock rate and one-result-per-cycle throughput.
4.2.6. Data Collection Parameters
The conditions under which the projection data are collected affect certain aspects of the backprojection computation. In particular, the spacing between samples in the
array and the spacing between pixels in the
array imply constant factors that must be accounted for during the distance-to-time index calculation (see Section 4.4.3).
For the input data,
indicates the distance (in meters) between samples in the azimuth dimension. This is equivalent to the distance that the plane travels between each outgoing pulse of the radar. Often, due to imperfect flight paths, this value is not regular. The data filtering that occurs prior to backprojection image formation is responsible for correcting for inaccuracies due to the actual flight path, so that a regular spacing can be assumed.
As the reflected radar data are observed by the radar receiver, they are sampled at a particular frequency
. That frequency translates to a range distance
between samples equal to
, where
is the speed of light. The additional factor of
accounts for the fact that the radar pulse travels the intervening distance, is reflected, and travels the same distance back. Owing to the fact that the airplane is not flying at ground level, there is an additional angle of elevation that is included to determine a more accurate value for
.
For the target image (output data),
and
simply correspond to the real distance between pixels in the range and azimuth dimensions, accordingly. In general,
and
are not necessarily related to
and
and can be chosen at will. In practice, setting
makes the algorithm computation more regular (and thus more easily parallelizable). Likewise, setting
reduces the need for interpolation between samples in the
dimension since most samples will line up with pixels in the range dimension. Finally, setting
provides for square pixels and an easier-to-read aspect ratio in the output image.
The final important parameter is the minimum range from the radar to the target image, known as
. This is related to the
parameter, and is used by the software to determine what portion of the projection data is applicable to a particular node.
4.3. Software Design
We now describe the HPRC implementation of backprojection. As with most FPGA-based applications, the work that makes up the application is divided between the host GPP and the FPGA. In this section, we will discuss the work done on the GPP; in Section 4.4, we continue with the hardware implemented on the FPGA.
The main executable running on the GPP begins by using the MPI library to spawn processes on several of the HHPC nodes. Once all MPI jobs have started, the host code configures the FPGA with the current values of the flight parameters from Section 4.2.6. In particular, the values of
and
(the minimum range) are sent to the FPGA. However, in order to avoid the use of fractional numbers, all of these parameters are normalized such that
. This allows the hardware computation to be in terms of fast-time indices in the
domain instead of ground distances.
Next, the radar data is read. In the Swathbuckler system, this input data would be streamed directly into memory and no separate "read" step would be required. Since we are not able to integrate directly with Swathbuckler, our host code reads the data from a file on the shared disk. These data are translated from complex floating-point format to integers. The host code also determines the appropriate range of
that is relevant to the subimage being calculated by this node (see Section 4.2.2).
The host code then loops over the
domain of the projection data. A chunk of the data is sent to the FPGA and processed. The host code waits until the FPGA signals processing is complete, and then transmits the next chunk of data. When all projection data have been processed, the host code requests that the final target image be sent from the FPGA. The pixels of the target image are scaled, rearranged into an image buffer, and an image file is optionally produced using a library call to the GTK+ library [29].
After processing, the target subimages are simply held in the GPP memory. In the Swathbuckler system, subimages are distributed to consumers via a publish/subscribe mechanism, so there is no need to assemble all the subimages into a larger image.
4.3.1. Configuration Parameters
Our backprojection implementation can be configured using several compile-time parameters in both the host code and the VHDL code that describes the hardware. In software, the values of
and
are set in the header file and compiled in. The value of
is specific to a dataset, so it is read from the file that contains the projection data.
It is also possible to set the dimensions of the subimage (
by default), though the hardware would require significant changes to support this.
The hardware VHDL code allows two parameters to be set at compile time (see Section 4.4.3).
is the number of projection adders in the design, and
is the size of the projection memories (
words). Once compiled, the value of these parameters can be read from the FPGA by the host code.
4.4. FPGA Design
The hardware that is instantiated on the FPGA boards runs the backprojection algorithm and computes the values of the pixels in the output image. A block diagram of the design is shown in Figure 3. References to blocks in this figure are printed in
.
4.4.1. Clock Domains
In general, using multiple clock domains in a design adds complexity and makes verification significantly more difficult. However, the design of the Annapolis Wildstar II board provides for one fixed-rate clock on the PCI interface, and a separate fixed-rate clock on the SRAM memories. This is a common attribute of FPGA-based systems.
To simplify the problem, we run the bulk of our design at the PCI clock rate (133 MHz). Since Annapolis VHDL modules refer to the PCI interface as the "LAD bus", we call this the L-clock domain. Every block in Figure 3, with the exception of the SRAMs themselves and their associated
, is run from the L-clock.
The SRAMs are run from the memory clock, or M-clock, which is constrained to run at 50 MHz. Between the
and the
, there is some interface logic and an FIFO. This is not shown in Figure 3, but exists to cross the M-clock/L-clock domain boundary.
BlockRAM-based FIFOs, available as modules in the Xilinx CORE Generator [30] library, are used to cross clock domains. Since each of the ports on the dual-ported BlockRAMs is individually clocked, the read and write can happen in different clock domains. Control signals are automatically synchronized to the appropriate clock, that is, the "full" signal is synchronized to the write clock and the "empty" signal to the read clock. Using FIFOs whenever clock domains must be crossed provides a simple and effective solution.
4.4.2. Control Registers and DMA Input
The Annapolis API, like many FPGA control APIs, allows for communication between the host PC and the FPGA with two methods: "programmed" or memory-mapped I/O (PIO), which is best for reading and writing one or two words of data at a time; direct memory access (DMA), which is best for transferring large blocks of data.
The host software uses PIO to set control registers on the FPGA. Projection data is placed in a specially allocated memory buffer, and then transmitted to the FPGA via DMA. On the FPGA, the
receives the data and arranges it in the BlockRAMs inside the
.
4.4.3. Datapath
The main datapath of the backprojection hardware is shown in Figure 4. It consists of five parts: the
SRAMs that hold the target image, the
-
-
(DIC), the
, the
to perform the accumulation operation, and the
that drive all of the memories. These devices all operate in a synchronized fashion, though there are FIFOs in several places to temporally decouple the producers and consumers of data, as indicated in Figure 4 with segmented rectangles.
Address Generators
There are three data arrays that must be managed in this design: the input target data, the output target data, and the projection data. The pixel indices for the two target data arrays (
in Figure 4) are managed directly by separate address generators. The address generator for the
also produces pixel indices; the DIC converts the pixel index into a fast-time index that is used to address the BlockRAMs.
Because a single read/write operation to the SRAMs produces/consumes two pixel values, the address generators for the SRAMs run for half as many cycles as the address generator for the BlockRAMs. However, address generators run in the clock domain relevant to the memory that they are addressing, so
SRAM addresses take slightly longer to generate at 50 MHz than
BlockRAM addresses at 133 MHz.
Because of the use of FIFOs between the memories and the adders, the address generators for
and the
can run freely. FIFO control signals ensure that an address generator is paused in time to prevent it from overflowing the FIFO. The address generator for
is incremented whenever data are available from the output FIFO.
Distance-to-Time Index Calculator
The
-
-
(DIC) implements (2), which is comprised of two parts. At first glance, each of these parts involves computation that requires large amount of hardware and/or time to calculate. However, a few simplifying assumptions make this problem easier and reduce the amount of needed hardware.
Rather than implementing a tangent function in hardware, we rely on the fact that the beamwidth
of the radar is a constant. The host code performs the
function and sends the result to the FPGA, which is then used to calculate
. This value is used both on a coarse-grained level to narrow the range of pixels which are examined for each processing step, and on a fine-grained level to determine whether or not a particular pixel is affected by the current projection (see Figure 2).
The right-hand side of (2) is a distance function (
) and a division. The square root function is executed using an iterative shift-and-subtract algorithm. In hardware, this algorithm is implemented with a pipeline of subtractors. Two multiplication units handle the
and
functions. Some additional adders and subtractors are necessary to properly align the input data to the output data according to the data collection parameters discussed in Section 4.2.6. We used pipelined multipliers and division units from the Xilinx CORE Generator library; adders and subtractors are described with VHDL arithmetic operators, allowing the synthesis tools to generate the appropriate hardware.
The distance function and computation of
occur in parallel. If the
function determines that the pixel is outside the affected range, the adder input is forced to zero.
Projection Data Blockrams
The output of the DIC is a fast-time index into the
array. Each
BlockRAM holds the data for a particular value of
. The fast-time index
is applied to retrieve a single value of
that corresponds to the pixel that was input by the address generator. This value is stored in an FIFO, to be synchronized with the output of the
FIFO.
memories are configured to hold 2 k datawords by default, which should be sufficient for a 1 k range pixel image. This number is a compile-time parameter in the VHDL source and can be changed. The resource constraint is the number of available BlockRAMs.
Projection Adder
As the FIFOs from the
memories and the
are filled, the
reads datawords from both FIFOs, adds them together, and passes them to the next stage in the pipeline (see Figure 4).
The design is configured with eight adder stages, meaning eight projections can be processed in one step. This number is a compile-time parameter in the VHDL source and can be changed. The resource constraint is a combination of the number of available BlockRAMs (because the
BlockRAMs and FIFO are duplicated) and the amount of available logic (to implement the DIC).
The number of adder stages implemented directly impacts the performance of our application. By computing the contribution of multiple projections in parallel, we exploit the fine-grained parallelism inherent in the backprojection algorithm. Fine-grained parallelism is directly related to the performance gains achieved by implementing the application in hardware, where many small execution units can be implemented that all run at the same time on different pieces of data.
4.4.4. Complex Magnitude and DMA Output
When all projections have been processed, the final target image data reside in one of the
SRAMs. The host code then requests that the image data be transferred via DMA to the host memory. This process occurs in three steps.
First, an
reads the data out of the SRAM in the correct order. Second, the data are converted from complex to real. The
operator performs this function with a distance calculation (
). We instantiate another series of multipliers, adders, and subtractors (for the integer square root) to perform this operation. Third, the real-valued pixels are passed to the
, where they are sent from the FPGA to the host memory.