FPGA-based implementations have the advantage over current software-based implementations of being able to use customizable number representations in their circuit designs. On a software platform, users are usually constrained to a few fixed number representations, such as 32/64-bit integers and single/double-precision floating-point, while the reconfigurable logic and connections on an FPGA enables the users to explore various kinds of number formats with arbitrary bit widths. Furthermore, users are also able to design the arithmetic operations for these customized number representations, thus can provide a highly customized solution for a given problem.
In general, to provide a customized number representation for an application, we need to determine the following three things.
-
(i)
Format of The Number Representation
There are existing FPGA applications using fixed-point, floating-point, and logarithmic number system (LNS) [12]. Each of the three number representations has its own advantages and disadvantages over the others. For instance, fixed-point has simple arithmetic implementations, while floating-point and LNS provide a wide representation range. It is usually not possible to figure out the optimal format directly. Exploration is needed to guide the selection.
-
(ii)
Bit Widths of Variables
This problem is generally referred to as bit width or word-length optimization [6, 13]. We can further divide this into two different parts: range analysis considers the problem of ensuring that a given variable inside a design has a sufficient number of bits to represent the range of the numbers, while in precision analysis, the objective is to find the minimum number of precision bits for the variables in the design such that the output precision requirements of the design are met.
-
(iii)
Design of The Arithmetic Units
The arithmetic operations of each number system are quite different. For instance, in LNS, multiplication, division, and exponential operations become as simple as addition or shift operations, while addition and subtraction become nonlinear functions to approximate. The arithmetic operations of regular data formats, such as fixed-point and floating-point, also have different algorithms with different design characteristics. On the other hand, evaluation of elementary functions plays a large part in seismic applications (trigonometric and exponential functions). Different evaluation methods and configurations can be used to produce evaluation units with different accuracies and performance.
This section presents our tool that tries to figure out the above three design options by exploring all the possible number representations. The tool is partly based on our previous work on bit-width optimization [6] and comparison between different number representations [14, 15].
Figure 1 shows our basic work flow to explore different number representations for a seismic application. We manually partition the Fortran program into two parts: one part runs on CPUs and we try to accelerate the other part (target code) on FPGAs. The partition is based on two metrics: (1) the target code shall consume a large portion of processing time in the entire program, otherwise the acceleration does not bring enough performance improvement to the entire application; (2) the target code shall be suitable for a streaming implementation on FPGA, thus highly probable to accelerate. After partition, the first step is to profile the target code to acquire information about the range of values and their distribution that each variable can take. In the second step, based on the range information, we map theFortrancode into a hardware design described in ASC format, which includes implementation of arithmetic operations and function evaluation. In the third step, the ASC description is translated into bit-accurate simulation code, and merged into the original Fortran program to provide a value simulator for the original application. Using this value simulator, explorations can be performed with configurable settings such as different number representations, different bit widths, and different arithmetic algorithms. Based on the exploration results, we can determine the optimal number format for this application with regards to certain metrics such as circuit area and performance.
3.1. Range Profiling
In the profiling stage, the major objective is to collect range and distribution information for the variables. The idea of our approach is to instrument every target variable in the code, adding function calls to initialize data structures for recording range information and to modify the recorded information when the variable value changes.
For the range information of the target variables (variables to map into the circuit design), we keep a record of four specific points on the axis, shown in Figure 2. The points
and
represent the values far away from zero, that is, the maximum absolute values that need to be represented. Based on their values, the integer bit width of fixed-point numbers can be determined. Points
and
represent the values close to zero, that is, the minimum absolute values that need to be represented. Using both the minimum and maximum values, the exponent bit width of floating-point numbers can be determined.
For the distribution information of each target variable, we keep a number of buckets to store the frequency of values at different intervals. Figure 3 shows the distribution information recorded for the real part of variable "wfld" (a complex variable). In each interval, the frequency of positive and negative values is recorded separately. The results show that, for the real part of variable "wfld," in each interval, the frequencies of positive and negative values are quite similar, and the major distribution of the values falls into the range
to
.
The distribution information provides a rough metric for the users to make an initial guess about which number representations to use. If the values of the variables cover a wide range, floating-point and LNS number formats are usually more suitable. Otherwise, fixed-point numbers shall be enough to handle the range.
3.2. Circuit Design: Basic Arithmetic and Elementary Function Evaluation
After profiling range information for the variables in the target code, the second step is to map the code into a circuit design described in ASC. As a high-level FPGA programming language, ASC provides hardware data types, such as HWint, HWfix, and HWfloat. Users can specify the bit-width values for hardware variables, and ASC automatically generates corresponding arithmetic units for the specified bit widths. It also provides configurable options to specify different optimization modes, such as AREA, LATENCY, and THROUGHPUT. In the THROUGHPUT optimization mode, ASC automatically generates a fully pipelined circuit. These features make ASC an ideal hardware compilation tool to retarget a piece of software code onto the FPGA hardware platform.
With support for fixed-point and floating-point arithmetic operations, the target Fortran code can be transformed into ASC C++ code in a straightforward manner. We also have interfaces provided by ASC to modify the internal settings of these arithmetic units.
Besides basic arithmetic operations, evaluation of elementary functions takes a large part in seismic applications. For instance, in the first piece of target code we try to accelerate, the FK step, a large portion of the computation is to evaluate the square root and sine/cosine functions. To map these functions into efficient units on the FPGA board, we use a table-based uniform polynomial approximation approach, based on Dong-U Lee's work on optimizing hardware function evaluation [16]. The evaluation of the two functions can be divided into three different phases [17].
-
(i)
Range reduction: reduce the range of the input variable
into a small interval that is convenient for the evaluation procedure. The reduction can be multiplicative (e.g.,
for square root function) or additive (e.g.,
for sine/cosine functions).
-
(ii)
Function evaluation: approximate the value of the function using a polynomial within the small interval.
-
(iii)
Range reconstructions: map the value of the function in the small interval back into the full range of the input variable
.
To keep the whole unit small and efficient, we use degree-one polynomial so that only one multiplication and one addition are needed to produce the evaluation result. Meanwhile, to preserve the approximation error at a small scale, the reduced evaluation range is divided into uniform segments. Each segment is approximated with a degree-one polynomial, using the minimax algorithm. In the FK step, the square root function is approximated with 384 segments in the range of [0.25, 1] with a maximum approximation error of
, while the sine and cosine functions are approximated with 512 segments in the range of [0, 2] with a maximum approximation error of
.
3.3. Bit-Accurate Value Simulator
As discussed in Section 3.1, based on the range information, we are able to determine the integer bit width of fixed-point, and partly determine the exponent bit width of floating-point numbers (as exponent bit width does not only relate to the range but also to the accuracy). The remaining bit widths, such as the fractional bit width of fixed-point, and the mantissa bit width of floating-point numbers, are predominantly related to the precision of the calculation. In order to find out the minimum acceptable values for these precision bit widths, we need a mechanism to determine whether a given set of bit-width values produce satisfactory results for the application.
In our previous work on function evaluation or other arithmetic designs, we set a requirement of the absolute error of the whole calculation, and use a conservative error model to determine whether the current bit-width values meet the requirement or not [6]. However, a specified requirement for absolute error does not work for seismic processing. To find out whether the current configuration of precision bit width is accurate enough, we need to run the whole program to produce the seismic image, and find out whether the image contains the correct pattern information. Thus, to enable exploration of different bit-width values, a value simulator for different number representations is needed to provide bit-accurate simulation results for the hardware designs.
With the requirement to produce bit-accurate results as the corresponding hardware design, the simulator also needs to be efficiently implemented, as we need to run the whole application (which takes days using the whole input dataset) to produce the image.
In our approach, the simulator works with ASC format C++ code. It reimplements the ASC hardware data types, such as HWfix and HWfloat, and overloads their arithmetic operators with the corresponding simulation code. For HWfix variables, the value is stored in a 64-bit signed integer, while another integer is used to record the fractional point. The basic arithmetic operations are mapped into shifts and arithmetic operations of the 64-bit integers. For HWfloat variables, the value is stored in a 80-bit extended-precision floating-point number, with two other integers used to record the exponent and mantissa bit width. To keep the simulation simple and fast, the arithmetic operations are processed using floating-point values. However, to keep the result bit accurate, during each assignment, by performing corresponding bit operations, we decompose the floating-point value into mantissa and exponent, truncate according to the exponent and mantissa bit widths, and combine them back into the floating-point value.
3.4. Accuracy Evaluation of Generated Seismic Images
As mentioned above, the accuracy of a generated seismic image depends on the pattern contained inside, which estimates the geophysical status of the investigated area. To judge whether the image is accurate enough, we compare it to a "target" image, which is processed using single-precision floating-point and assumed to contain the correct pattern.
To perform this pattern comparison automatically, we use techniques based on prediction error filters (PEFs) [18] to highlight differences between two images. The basic work flow of comparing image
to image
(assume image
is the "target" image) is as follows.
-
(i)
Divide image
into overlapping small regions of
pixels, and estimate PEFs for these small regions.
-
(ii)
Apply these PEFs to both image
and image
to get the results
and
.
-
(iii)
Apply algebraic combinations of the images
and
to acquire a value indicating the image differences.
By the end of the above work flow, we achieve a single value which describes the difference from the generated image to the "target image." For convenience of discussion afterwards, we call this value as "difference indicator" (DI).
Figure 4 shows a set of different seismic images calculated from the same dataset, and their DI values compared to the image with correct pattern. The image showing correct pattern is calculated using single-precision floating-point, while the other images are calculated using fixed-point designs with different bit-width settings. All these images are results of the bit-accurate value simulator mentioned above.
If the generated image contains no information at all (as shown in Figure 4(a)), the comparison does not return a finite value. This is mostly because a very low precision is used for the calculation. The information is lost during numerous iterations and the result only contains zeros or infinities. If the comparison result is in the range of
to
(Figures 4(b) and 4(c)), the image contains random pattern which is far different from the correct one. With a comparison result in the range of
(Figure 4(d)), the image contains similar pattern to the correct one, but information in some parts is lost. With a comparison result in the range of
or smaller, the generated image contains almost the same pattern as the correct one.
Note that the DI value is calculated from algebraic operations on the two images you compare with. The magnitude of DI value is only a relative indication of the difference between the two images. The actual usage of the DI value is to figure out the boundary between the images that contains mostly noises and the images that provide useful patterns of the earth model. From the samples shown in Figure 7, in this specific case, the DI value of
is a good guidance value for acceptable accuracy of the design. From the bit-width exploration results shown in Section 4, we can see that the DI value of
also happens to be a precision threshold, where the image turns from noise into accurate pattern with the increase of bit width.
3.5. Number Representation Exploration
Based on all the above modules, we can now perform exploration of different number representations for the FPGA implementation of a specific piece of Fortran code.
The current tools support two different number representations, fixed-point, and floating-point numbers (the value simulator for LNS is still in progress). For all the different number formats, the users can also specify arbitrary bit widths for each different variable.
There are usually a large number of different variables involved in one circuit design. In our previous work, we usually apply heuristic algorithms, such as ASA [19], to find out a close-to-optimal set of bit-width values for different variables. The heuristic algorithms may require millions of test runs to check whether a specific set of values meet the constraints or not. This is acceptable when the test run is only a simple error function and can be processed in nanoseconds. In our seismic processing application, depending on the problem size, it takes half an hour to several days to run one test set and achieve the result image. Thus, heuristic algorithms become impractical.
A simple and straightforward method to solve the problem is to use uniform bit width over all the different variables, and either iterate over a set of possible values or use a binary search algorithm to jump to an appropriate bit-width value. Based on the range information and the internal behavior of the program, we can also try to divide the variables in the target Fortran code into several different groups, and assign a different uniform bit width for each different group. For instance, in the FK step, there is a clear boundary that the first half performs square, square root, and division operations to calculate an integer value, and the second half uses the integer value as a table index, and performs sine, cosine, and complex multiplications to get the final result. Thus, in the hardware circuit design, we divide the variables into two groups based on which half it belongs to. Furthermore, in the second half of the function, some of the variables are trigonometric values in the range of
, while the other variables represent the seismic image data and scale up to
. Thus, they can be further divided into two parts and assigned bit widths separately.