### 3.1 Heterogeneous embedded computing platform

Although MV adaptive beamforming algorithm outputs high-quality medical ultrasonic images, it is computationally demanding. The high computational complexity of MV adaptive beamforming algorithm hinders MV algorithm being implemented in real-time on conventional embedded computing platforms, such as conventional ARM processors. Therefore, the implementation of MV adaptive beamforming algorithm on heterogeneous embedded computing platform with high-performance GPU is explored in this paper, so as to validate the real-time imaging capability of MV adaptive beamforming algorithm on embedded platforms.

The symbolic architecture modules of the heterogeneous embedded computing platform is illustrated in Fig. 3. As shown in Fig. 3, the ARM processors and the embedded GPU are within one single embedded processing chip, as well as the internal memory and the external memory modules. There are plenty of peripherals on the heterogeneous embedded computing platform, such as camera input module, display output module, USB, GPIO, and other common peripheral connector modules. The heterogeneous embedded computing platforms investigated in the MV adaptive beamforming algorithm implementation experiments and evaluations are shown in Section 4, which are the products of Nvidia Corporation.

### 3.2 Implementation strategies for embedded GPU

Before starting the implementation design of MV adaptive beamforming algorithm on the embedded GPU computing platform, the sequential MV adaptive beamforming code execution time profiling was carried out, so as to determine which parts were the time-consuming parts of the sequential algorithm code. The execution time profiling result is shown in Fig. 4. As seen from Fig. 4, the input data delay calculation and the covariance matrix calculation consumed most of the execution time of the sequential MV code. As *L*/*M* increased, the percentage of execution time used for covariance matrix calculation and adaptive weight calculation increased, but the percentage of execution time for input data delay calculation decreased. Such changes conformed to the computational complexities of different calculation steps. For example, the input data delay calculation had a computational complexity of *O*(*n*), the covariance matrix calculation had a computational complexity of *O*(*n*
^{2}), and the adaptive weight calculation had a computational complexity of *O*(*n*
^{3}). Based on the analysis of the sequential MV code, the code parts with high computational complexity or high percentage of execution time should be implemented on the GPU platform. As a result, also considering reducing the hardware/software communications between GPU and ARM processor, all parts of the MV adaptive beamforming algorithm were decided to implement on the GPU platform.

In order to effectively utilize the high-performance embedded GPU in the heterogeneous embedded computing platform to implement the MV adaptive beamforming algorithm efficiently, the following implementation strategies are applied.

#### 3.2.1 Allocation of GPU computing resources

The first implementation strategy is about the allocation of the GPU computing resources. According to the GPU compute unified device architecture (CUDA) programming principles, the GPU CUDA programming model consists of three programming hierarchy levels, i.e., GPU compute grid, GPU compute block, and GPU compute thread. At the top level of the programming hierarchy, all the algorithm computations are executed within one GPU compute grid. Meanwhile, at the second level of the programming hierarchy, the program tasks are allocated into a set of GPU compute blocks. The computation in different GPU compute blocks can be executed in parallel. Besides, at the third level of the programming hierarchy, the computational workloads are assigned to a series of GPU compute threads. The programs in different GPU compute threads are executed simultaneously, while the program instructions within one GPU compute thread are executed sequentially.

Such hierarchical GPU CUDA programming model can be applied to the medical ultrasonic image formation process which utilizes MV adaptive beamforming algorithm. In the medical ultrasonic image formation process, the image pixel amplitude estimate values of the whole image are calculated following MV adaptive beamforming algorithm calculation steps as described in Section 2. The image pixels are organized in rows and columns, which can be mapped to the GPU compute grid with a two-dimensional GPU compute block allocation, as shown in Fig. 5. As a result, each GPU compute block is responsible to the computation of one image pixel amplitude estimate value calculation. The program steps used to calculate the image pixel amplitude estimate value are executed inside one block via the simultaneous computation of the parallel threads within the block. Finally, the sequential operations of the program, which cannot be executed at the same time, are computed inside the threads. The best practices of the GPU block size and the GPU thread size rely on the arrangement of the computational resources on the embedded computing platform according to the computational problem size.

#### 3.2.2 Overall design overview

Referring to the allocation of GPU computing resources illustrated in Fig. 5, the high-level design block diagram of the MV adaptive beamforming algorithm for GPU implementation is demonstrated in Fig. 6. Based on the overall design block diagram, the fine-grained parallelization of the algorithm implementation was conducted on the embedded GPU.

In the design, the MV beamforming implementation took *M* receive channels as its inputs to generate an amplitude estimation of one image pixel. The input data from the ultrasonic echo receive channels was also known as pre-beamform data before the beamforming process, and the output pixel value was also known as post-beamform data after the beamforming process.

Each receive channel streamed input echo samples to the delay calculation block, forming an *M*×1 vector of delayed echo samples as its output. The purpose of the delay calculation block was to align the input ultrasonic echoes based on the delay information among the receive channels. The delayed echo vector must subsequently be multiplied by the adaptive apodization weights. Hence, while the adaptive weight calculation block was calculating the adaptive weights, the echo vector was stored in the data storage buffer for later usage, which was built with GPU shared memory. The purpose of data storage buffer was to let the delayed echo vector wait until the adaptive weight calculation block finished its workload.

Finally, the pixel amplitude estimation block outputs the final pixel value. It first multiplied (*M*−*L*+1) segmented delayed echo vectors **e**
**c**
**h**
**o**
_{subk}(*p*
_{
s
})(*k*=0,1,…,*M*−*L*) and their adaptive weights. The results of these (*M*−*L*+1) pixel value estimates were then averaged to obtain the final pixel amplitude value output.

#### 3.2.3 Adaptive weight calculation parallelization

In order to implement a highly parallelized MV algorithm, probability theories and linear algebra theories were used to optimize the detailed implementation and reduce the computation operations. The integration of the mathematical theories into the implementation will be described in the following three parts.

The adaptive weight calculation block in Fig. 6 performs the core computation of the MV adaptive beamforming algorithm. It consists of three major units: covariance matrix calculation, linear equation solver, and final weight calculation. Here, the inner working principles of these blocks will be elaborated.

##### 3.2.3.1 Covariance matrix calculation

Derived from (1), the covariance matrix element Covar_{
ij
}(*p*
_{
s
}) is calculated as

$$ {\text{Covar}_{ij}(p_{s})=\frac{\sum_{k=0}^{M-L} \text{echo}_{(i+k)}(p_{s}) \text{echo}_{(j+k)}(p_{s})}{M-L+1},} $$

(4)

where echo_{(i+k)}(*p*
_{
s
}) is the (*i*+*k*)th element in **e**
**c**
**h**
**o**(*p*
_{
s
}) vector. As the input digital echo data are real numbers, the covariance matrix is a symmetric matrix [15], which has the following property:

$$ \mathbf{Covar}^{T}(p_{s}) = \mathbf{Covar}(p_{s}). $$

(5)

As a result, only the diagonal elements and the lower (**L**) or upper (**U**) triangular matrix elements of the covariance matrix need to be calculated. Therefore, *L*×(*L*+1)/2 calculations are needed in stead of *L*×*L* calculations. Taking the advantage of the symmetry makes the covariance matrix implementation nearly twice faster.

##### 3.2.3.2 Linear equation solver

As shown in the weight calculation (2), **C**
**o**
**v**
**a**
**r**
^{−1}(*p*
_{
s
}) has to be calculated. But also shown in (2), every **C**
**o**
**v**
**a**
**r**
^{−1}(*p*
_{
s
}) is multiplied by a vector *a*. As a result, a linear equation solver which outputs **C**
**o**
**v**
**a**
**r**
^{−1}(*p*
_{
s
})*a* can take over the places of the matrix inverse unit and the matrix multiplication unit. The solver is used to solve a system of linear equations like:

$$ {\mathbf{Covar}^{-1}(p_{s})~\boldsymbol{y} = \boldsymbol{a}.} $$

(6)

Therefore, using a system solver reduces extra operation time and storage resources. The covariance matrix has positive-semidefinite and symmetric properties [15], hence, Cholesky decomposition [16] is applicable to this weight solver in regular MV algorithm. Cholesky decomposition is derived from Gaussian elimination, but halves the decomposition operations and is more stable than **L**
**U** decomposition which is the matrix form of Gaussian elimination. **L**
**D**
**L**
^{T} decomposition form of the Cholesky decomposition was adopted.

Since the weight solver was iterative, the iterations cannot be parallelized. But inside the iterations, parallelization was achievable. For example, the **L** matrix was formed column by column and only one column in one iteration, but the element calculations within each column of **L** could be parallelized.

##### 3.2.3.3 Final weight calculation

The final step of the weight calculation is to calculate

$$ {\boldsymbol{w}(p_{s})=\frac{\boldsymbol{y}}{\boldsymbol{a}^{H} \boldsymbol{y}}.} $$

(7)

As *a* is a vector of ones, *a*
^{H}
*y* can be calculated as

$$ {\boldsymbol{a}^{H} \boldsymbol{y} = \sum_{n=1}^{L} y_{n}.} $$

(8)

Therefore,

$$ {\boldsymbol{w}(p_{s})=\frac{\boldsymbol{y}}{\sum_{n=1}^{L} y_{n}}.} $$

(9)

As a result, the final weight calculation step can be parallelized.

#### 3.2.4 Assignment of GPU memory access

The memory access strategy of GPU implementation has an important impact on the overall GPU computational speed. There are basically three types of memory modules inside the GPU, i.e., global memory, shared memory, and register files. The lifetime of the data in the three memory types is associated to the GPU computing resources respectively. The lifetime of the data in the global memory is associated to the GPU compute grid, the lifetime of the data in the shared memory is associated to the GPU compute block, and the lifetime of the data in the register files is associated to the GPU compute thread.

These three types of memory modules are fabricated according to different architectural hierarchies. The register files reside right next to the GPU processing cores, the shared memory locates in a place with a farther distance to the GPU processing cores, and the global memory locates farthest to the GPU cores. The distance to the GPU cores determines the access speed of the specific memory type. As a result, the access speed of the register files is the fastest, the access speed of the shared memory is slower than that of the register files, and the access speed of the global memory is the slowest among the three types of memory modules. However, on the other hand, the memory size of a specific memory type is inversely proportional to its memory access speed. Thus, the size of the global memory is the largest, the size of the register files is the smallest, and the size of the shared memory is between the size of the global memory and the size of the register files. Therefore, in a GPU program memory assignment, small-size and frequently-used variables can be stored in the register files, while large amount of data should be stored in the global memory. Furthermore, the shared memory is usually used for the shared data space utilized in the computation among the GPU compute threads within a specific GPU compute block.

The GPU memory access assignment of the implementation design is illustrated in Fig. 7. As shown in Fig. 7, the utilization of the slowest global memory was restricted to store input and output data of the beamforming process, i.e., pre-beamform data and post-beamform data. As a result, the communications between ARM and GPU happen only at the beginning and at the end of the MV beamforming algorithm. The faster shared memory was used as storage of intermediate results during the beamforming process. The fastest register files were assigned to hold the temporary results within beamforming steps.