 Research
 Open Access
 Published:
Real time simultaneous localization and mapping: towards lowcost multiprocessor embedded systems
EURASIP Journal on Embedded Systemsvolume 2012, Article number: 5 (2012)
Abstract
Simultaneous localization and mapping (SLAM) is widely used by autonomous robots operating in unknown environments. Research community has developed numerous SLAM algorithms in the last 10 years. Several works have presented many algorithms’ optimizations. However, they have not explored a system optimization from the system hardware architecture to the algorithmic development level. New computing technologies (SIMD coprocessors, DSP, multicores) can greatly accelerate the system processing but require rethinking the algorithm implementation. This article presents an efficient implementation of the EKFSLAM algorithm on a multiprocessor architecture. The algorithmarchitecture adequacy aims to optimize the implementation of the SLAM algorithm on a lowcost and heterogeneous architecture (implementing an ARM processor with SIMD coprocessor and a DSP core). Experiments were conducted with an instrumented platform. Results aim to demonstrate that an optimized implementation of the algorithm, resulting from an optimization methodology, can help to design embedded systems implementing lowcost multiprocessor architecture operating under realtime constraints.
Introduction
Autonomous robots must be able to localize themselves. Simultaneous localization and mapping (SLAM) algorithms aim to build an environment map while estimating the robot pose. Many researches were conducted to develop SLAM algorithms like extended Kalman filter for SLAM (EKFSLAM)[1, 2], FAST SLAM[3], GRAPH SLAM[4], DPSLAM[5] which aim to improve consistency, accuracy or robustness. Other algorithms derivate from the EKFSLAM, such as algorithms using unscented Kalman filter (UKF)[6] which increases the localization accuracy against the classical EKF algorithm based on a linearized model. Only few works deal with the implementation of lowcost SLAM embedded systems.
Most of SLAM implementations rely on the use of accurate and dense measurements provided by expensive sensors like laser rangefinder sensors[7] or time of flight cameras[8]. Highpriced smart sensors are not suitable to be integrated in most of embedded systems in commercial objectives or industrial applications.
Simultaneous localization and mapping systems using lowcost sensors have been recently designed. Abrate et al.[9] provide an implementation of the EKFSLAM algorithm on a Khepera robot. The robot hosts limited range, sparse and noisy IR sensors. Experimental results have shown the importance of the sensor characteristics, the primitives (lines) extraction and data association. Yap and Shelton[10] use cheap, noisy and sparse sonar sensors embedded in a P3DX robot. To cope with these lowcost sensors, the implemented SLAM algorithm uses a multiscan approach and an orthogonality assumption to map indoor environments.
Classical SLAM algorithms are too computationally intensive to run on an embedded computing unit. They require at least laptoplevel performances. Gifford et al.[11] present a lowcost approach to autonomous multirobot mapping and exploration for unstructured environments. The robot hosts a Gumstix computing unit (600 Mhz), 6 IR scanning range arrays, a 3axis gyroscope and odometers. Running DPSLAM alone on the Gumstix with 15 particles takes on average 3 s per update. While using 25 particles, it takes more than 10 s per update. Authors have underlined the difficulty to find the right SLAM parameters to fit within the available computing power and the realtime processing. Magnenat et al.[12] present a system based on the codesign of a lowcost sensor (a slim rotating scanner), a SLAM algorithm, a computing unit, and an optimization methodology. The computing unit is based on an ARM processor (533 Mhz) running a FASTSLAM 2.0 algorithm[13]. Magnenat et al.[12] use an evolution strategy to find the best configuration of the algorithm and setting of the parameters.
As pointed out by[11, 12], the first improvement of a SLAM algorithm is an efficient setting of the various parameters of the algorithm. Other modifications were investigated to reach realtime constraints. These modifications are necessary due to the low computing power and limited memory resources available on embedded systems. Features restriction for EKFSLAM algorithm has been implemented to decrease the processing time[14]. Schroter et al.[15] focused on reducing the memory footprint of particlebased gridmap SLAM by sharing the map between several particles.
Robust laserbased SLAM navigation has long existed in robot applications, but systems implement sensors that, in some cases, are more expensive than the final product. Neato Robotics has developed a vacuum cleaner that implements a navigation system using a SLAM algorithm. The approach is based on a lowcost system implementing a designed laser rangefinder[16].
This article presents an efficient implementation of the EKFSLAM algorithm on a multiprocessor architecture. The approach is based on an algorithm implementation adequate to a defined architecture. The aim is to optimize the implementation of the SLAM algorithm on a lowcost and heterogeneous architecture implementing an SIMD coprocessor (NEON) and a DSP core. The hardware includes several lowcost sensors. As[17], we chose to use a lowcost camera (exteroceptive sensor) and odometers (proprioceptive sensors). Following[12], we efficiently tune the parameters of the SLAM algorithm. We improve on previous works by proposing an adequate implementation of the EKFSLAM algorithm on a multiprocessing architecture (ARM processor, SIMD NEON coprocessor, DSP core). The specifications related to the NEON coprocessor and the DSP core improve the processing time and the system performance. Results aim to demonstrate that an optimized implementation of the algorithm, resulting from an evaluation methodology, can help to design embedded systems implementing lowcost multiprocessor architecture operating under realtime constraints.
Section “EKFSLAM algorithm” introduces the EKFSLAM algorithm. Section “Multiprocessor architectureand system configuration” presents the embedded multiprocessor architecture and the system configuration. Section “Evaluation methodology and algorithm implementation” details the evaluation methodology, provides a first algorithm implementation and analyzes this implementation in terms of processing time. A Hardware–software optimization is proposed and analyzed in Section “Hardware–software optimization and improvements”. It presents SIMD optimizations and DSP parallelization. A performance comparison is then performed between the optimized and nonoptimized instances. Finally, Section “Conclusion” concludes this article.
EKFSLAM algorithm
Overview
Extended Kalman filter for SLAM estimates a state vector containing both the robot pose and the landmark locations. We consider that the robot is moving on a plane. The algorithm uses 3D points as landmarks. It uses proprioceptive sensors to compute a predicted vector and then corrects this state using exteroceptive sensors. In this article, we consider a wheeled robot embedding two odometers (attached to each rear wheel) and a camera.
State vector and covariance matrix
With N landmarks, the state vector is defined as:
where:

x,z are the ground coordinates (xaxis, zaxis) of the robot rear axle center. We suppose that the robot is always moving on the ground, so y=0 (no elevation) and y does not appear in Equation (1).

θ is the orientation of a local frame attached to the robot with respect to the global frame.

${x}_{{a}_{1}},{y}_{{a}_{1}},{z}_{{a}_{1}},\dots ,{x}_{{a}_{N}},{y}_{{a}_{N}},{z}_{{a}_{N}}$ are the 3D coordinates of the N landmarks in the global frame.
The state covariance matrix is defined as:
Prediction
The prediction step relies on the measurements of the proprioceptive sensors, the odometers, embedded on our experimental platform. A non linear discretetime statespace model is considered to describe the evolution of the robot configuration x:
where u_{ k }is a known twodimensional control vector, assumed constant between the times indexed by k−1 and k, and v_{ k }is an unknown state perturbation vector that accounts for the model uncertainties. x_{k−1k−1} represents the state vector at time k1, x_{kk−1} represented the state vector after the prediction step, x_{kk}represents the state vector after the estimation step. The classical evolution model, described in[18], is considered:
where u_{ k }=(δs δθ); δs is the longitudinal motion and δθ is the rotational motion[19]:
where:

w_{ r }and w_{ l }are respectively the radius of the right and left wheel.

e is the length of the rear axle.

$\delta {\phi}_{i}=\delta {p}_{i}\frac{2\Pi}{\rho}$ with i∈{r,l} (r=right, l=left), δ p_{ i }: number of steps, ρ: odometer resolution. δ φ_{ i }is the angular movement of the right/left wheel.
The state covariance matrix is defined as:
where

$\frac{\partial \mathbf{f}}{\partial \mathbf{x}}=\left[\begin{array}{llllll}\hfill 1\hfill & \hfill 0\hfill & \hfill \mathrm{\delta s}sin\left({\theta}_{k1k1}+\frac{\mathrm{\delta \theta}}{2}\right)\hfill & \hfill 0\hfill & \hfill .\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill 1\hfill & \hfill \mathrm{\delta s}cos\left({\theta}_{k1k1}+\frac{\mathrm{\delta \theta}}{2}\right)\hfill & \hfill 0\hfill & \hfill .\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill 0\hfill & \hfill 1\hfill & \hfill 0\hfill & \hfill .\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 1\hfill & \hfill .\hfill & \hfill 0\hfill \\ \hfill .\hfill & \hfill .\hfill & \hfill .\hfill & \hfill .\hfill & \hfill .\hfill & \hfill .\hfill \\ \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill 0\hfill & \hfill .\hfill & \hfill 1\hfill \end{array}\right]$

Q_{ k }is the covariance matrix of the process noise.
Estimation
The estimation of the state is made using the camera which returns the position in the image (u_{ i }, v_{ i }) of the ith landmark.
The innovation and its covariance matrix: The pinhole model is used to project a known landmark position into the image:
where:

(u_{ i },v_{ i }) is the position of the ith landmark in the image.

$({x}_{{a}_{i}}^{\text{cam}},{y}_{{a}_{i}}^{\text{cam}},{z}_{{a}_{i}}^{\text{cam}})$ is the position of the ith landmark in the camera frame.

f is the focal length.

(k_{ u },k_{ v }) is the number of pixels per unit length.

s_{ uv }is a factor accounting for the skew due to nonrectangular pixels. In our case, we take s_{ uv }=0.
Equation (7) can be written as the predicted observation equation for a single landmark:
The pose of a landmark in the camera frame is defined from its pose (${x}_{{a}_{i}}$,${y}_{{a}_{i}}$,${z}_{{a}_{i}}$) in the global frame:
Where D is the length between the camera and the robot rear axle center.
During the observation step, the algorithm matches M landmarks (M<=N) whose observations are added in${\mathbf{h}}_{k}=\left(\begin{array}{l}\hfill {\mathbf{h}}^{0}\hfill \\ \hfill .\hfill \\ \hfill {\mathbf{h}}^{M1}\hfill \end{array}\right)$.
Thus, the innovation is:
where${\widehat{\mathbf{z}}}_{k}$ is the measurement for all the M predicted observations.
The innovation covariance S_{ k }is:
where H_{ k } is the Jacobian of h_{ k }and R_{ k } is the observation noise covariance.
State estimation: The state is updated using the classical EKF equations:
Visual landmarks
The landmarks used in the observation equation are extracted from images. Landmark initialization defines the initial coordinates and the initial covariance of landmarks localization (also called interest points or features). In[20], we have evaluated the processing time of corner detectors like Harris, ShiTomasi or FAST. Harris and ShiTomasi detectors were more time consuming than the FAST detector and do not provide significantly better localization results than FAST. Consequently, there is no need to implement more sophisticated algorithms such as Harris or Shi and Tomasi. FAST[21] (Features from Accelerated Segment Test) corner detector relies on a simple test performed for a pixel p by examining a circle of 16 pixels (a Bresenham circle of radius 3) centered on p. A feature is detected at p if the intensities of at least 12 contiguous pixels are all above or all below the intensity of p with a threshold t. Even if this detector is not highly robust to noises and depends on a threshold it produces stable landmarks and is computationally very efficient[21].
The FAST detector[21] is related to the wedgemodel style of detector evaluated using a circle surrounding a candidate pixel. To optimize the detector processingtime, this model is used to made a decision classifier which is applied to the image (Figure1).
Matching based on zeromean sum of squared differences
The EKFSLAM matches the previously detected feature with a new one using zeromean sum of squared differences (ZMSSD).
The covariance of the projected feature localization defines a searching area τ. This area includes the robot localization uncertainty and the landmarks localization uncertainty. We use the ZMSSD to find the best candidate point inside τ. For each candidate point p:(p_{ x },p_{ y }), the N_{ p }value of the weighted ZMSSD is:
and
where:

w(p_{ x },p_{ y }) is the Gaussian weights defined by the landmark covariance.

i∈[0;des−1] and j∈[0;des−1] and des is the descriptor size.

d is the feature descriptor.

m_{ d }and m_{ i }are respectively the means of the pixel values in the descriptor and in the image window.

im is the image.
The observation p_{obs} will be selected using$p=(p\in \tau {N}_{p}=min\left({N}_{{p}_{j}}\right),\forall {p}_{j}\in \tau )$.
The descriptor, used to identify the landmark during the matching, is classically a small image window of 9×9 pixels to 16×16 pixels around the interest point. Davison[22] claims that this sort of descriptor is able to serve as longterm landmark feature.
Landmark initialization based on davison method
Landmark initialization consists of defining the initial coordinates and the initial covariance of landmarks (interest points). Various methods exist and can be classified as an undelayed or delayed method. Undelayed method adds landmarks with only one measurement whereas the delayed method needs two or more frames. We chose to use the widely spread delayed method proposed by[2] which is both efficient and adequate to implement. Furthermore the work of Munguia and Grau[23] shows that the delayed method have the same performance as the undelayed method.
In order to compute the 3D depth of a newly detected landmark, as[2], we initialize a 3D line into the map along which the landmark must lie. This line starts at the estimated camera position and heads to infinity along the feature viewing direction. The line is composed of 100 particles which represent depth hypothesis. The prior probability used is uniform and the range is 0.5 to 15 m. At subsequent time, each particle (a feature depth hypothesis) is projected into the image, matched and its probability is reweighted.
When the ratio of the standard deviation of depth to the expected value is below a threshold, the distribution is approximated as a Gaussian and the landmark is initialized. The landmark pose A_{ i }=(x_{ a i },y_{ a i },z_{ a i }) is added to x and the A_{ i } covariance is added into P.
Multiprocessor architecture and system configuration
In order to test and validate the EKFSLAM algorithm, experiments were conducted with an instrumented mobile robot called Minitruck[24]. The platform was teleoperated during the experiments. For our first evaluation, the experiment consists to operate inside a large corridor of our research lab (see Figure2).
We have developed a system architecture on the top of a multiprocessor board (Gumstix Overo) based on the OMAP3530 chip (see Figure3). The OMAP chip integrate a RISC processors (ARM Cortex A8 500 Mhz) with an SIMD NEON coprocessor, a DSP (TMS320C64x + 430 Mhz) and a graphical processor unit (POWERVR SGX). This board communicates with an additional processor for control and data acquisition (Atmega168 16 Mhz).
Multiple sensors (odometers and a camera) are interfaced to this architecture (Figure2). The variety of sensors enables us to evaluate the SLAM algorithms with different types of sensor data and take advantage of the information complementary of these sensors. Our objective is to evaluate the implementation of SLAM algorithms using land vehicles and sensors, like steering encoders and a camera.
The use of wheel and steer encoders is obvious in robotics and navigation. Simple kinematic motion models can be used to integrate velocity and heading measurements from wheel and steer encoders to provide an estimation of the mobile robot location and orientation. Estimations are regularly subject to considerable errors due to misalignment, offsets and wheels slippage. It is possible to implement basic models to approximate and correct offset and slippage errors online leading to significant improvement of performances. We chose two HEDS 5540 odometers for our experimental vehicle.
The feature detection in SLAM application relies on the embedded sensors. We chose to achieve this extraction using a vision sensor (a cheap USB webcam, Philips SPC530NC, delivering 30 fps). We chose to use all possible images (30 fps) because it is much easier to perform point matching if the movement is small. Conventional approaches for vision systems design are usually based on general purpose computers interfaced with cameras. The new computing technologies (SIMD, DSP, multicores) can greatly accelerate algorithm processing, but require rethinking these algorithms by optimizing the parallelism. This parallel processing is pushed to integrate near the sensors parallel computing units[25]. We have used a Gumstix processing module based on OMAP3530 architecture. It is an heterogeneous architecture (ARM CortexA8 500 Mhz processor with power consumption less than 300 mW, SIMD NEON integrated coprocessor, DSP C64x processor and a 3D graphics accelerator) that communicates via a WLAN connection (802.11 g).
The WLAN connection is used only to control speed and direction of the vehicle. In the future, a dedicated algorithm to autonomous navigation will be implemented and thus the WLAN connection will be used to achieve only the system monitoring. A coprocessor (ATMega168) takes care of data acquisition. It controls the robot speed and its direction using two pulsewidth modulation (PWM) signals. It decodes signals coming from the odometers embedded in the rear wheels. It communicates with the main board using an I2C interface. This interface allows the main processor to retrieve odometers data and to send instructions corresponding to speed and direction.
To evaluate the designed system, an experiment was achieved in a corridor of our lab. Frames have been grabbed at 30 fps with 320×240 resolution. Odometer data were sampled at 30 Hz. During the experiment, references are periodically drawn on the ground by an embedded marker.
Evaluation methodology and algorithm implementation
Our evaluation methodology is based on the identification of the processing tasks requiring a significant computing time. It is based on several steps: we analyze first the execution time of tasks and their dependencies on the algorithm’s parameters. A threshold is fixed for each parameter. The algorithm is then partitioned in order to have functional blocks (FBs) performing defined calculations. Each block is then evaluated to determine its processing time. Function blocks that require the most important execution time are then optimized to reduce the global processing time.
Algorithm Algorithm 1 EKFSLAM summarizes the main tasks of EKFSLAM. The algorithm is composed of two process: Prediction and Correction. The correction process implements three tasks: matching, estimation and initialization.
Algorithm 1 EKFSLAM
1: χ←⊘ ⊳ List of Landmarks for initialization
2: Robot pose initialization
3: while localization is required do
4: DATA ← Sensors Data acquisition
5: if DATA =(φ_{ l },φ_{ r })then ⊳ Odometer’s data
6: Prediction
7: (δs, δθ) ← g(φ_{ l },φ_{ r }) (see Eq (5))
8:${\mathbf{x}}_{kk1}\leftarrow \mathbf{f}({\mathbf{x}}_{k1k1},\mathrm{\delta s},\mathrm{\delta \theta})$ (see Eq (4))
9:${\mathbf{P}}_{kk1}\leftarrow \frac{\partial \mathbf{f}}{\partial \mathbf{x}}{\mathbf{P}}_{k1k1}{\frac{\partial \mathbf{f}}{\partial \mathbf{x}}}^{T}+{\mathbf{Q}}_{k}$ (see Eq (6))
10: else if DATA = Camera then
11: FAST detector applied on the image
12: Matching:
13: for Each Landmark N_{ i }∈x_{kk−1}do
14:${u}_{i},{v}_{i},{\tau}_{i}\leftarrow \text{pinhole}({\mathbf{x}}_{kk1},{N}_{i})$ (see Eq (8))
15: if(u_{ i },v_{ i })∈ Camera Frame then
16:${\widehat{\mathbf{z}}}_{k}\leftarrow \text{ZMSSD}({\tau}_{i},{N}_{i})$ (see Eq (13))
17:${\mathbf{h}}_{k}\leftarrow ({u}_{i},{v}_{i})$
18:${\mathbf{Y}}_{k}\leftarrow {\widehat{\mathbf{z}}}_{k}{\mathbf{h}}_{k}$
19:${\left(\right)close="">{\mathbf{H}}_{k}\leftarrow \frac{\partial {\mathbf{h}}_{k}}{\partial \mathbf{x}}}_{}{\mathbf{x}}_{kk1}$
20: end if
21: end for
22: Estimation:
23:${\mathbf{S}}_{k}\leftarrow {\mathbf{H}}_{k}{\mathbf{P}}_{kk1}{\mathbf{H}}_{k}^{T}+{\mathbf{R}}_{k}$ (see Eq (11))
24:${\mathbf{K}}_{k}\leftarrow {\mathbf{P}}_{kk1}{\mathbf{H}}_{k}{\mathbf{S}}_{k}^{1}$ (see Eq (11))
25:${\mathbf{x}}_{kk}\leftarrow {\mathbf{x}}_{kk1}+{\mathbf{K}}_{k}{\mathbf{Y}}_{k}$ (see Eq (12))
26:${\mathbf{P}}_{kk}\leftarrow (\mathbf{I}{\mathbf{K}}_{k}{\mathbf{H}}_{k}){\mathbf{P}}_{kk1}$ (see Eq (12))
27: Initialization:
28: for Each L ∈ χ do ⊳ L: Aspiring new Landmark
29: L_{obs} ← ZMSSD(L) (see Eq (13))
30: Update the particles weight according L_{obs}(see[2])
31: Compute σ_{depth}, depth
32: if$\frac{{\sigma}_{\text{depth}}}{\text{depth}}$ < ε then
33: Compute L, P_{ L }
34: append(x_{kk−1},L); append(P_{kk−1},P_{ L })
35: remove(χ,L)
36: end if
37: end for
38: if Lack of Landmark then ⊳ see[8]
39: append(χ, New_Landmarks)
40: end if
41: end if
42: end while
Prediction process
This phase updates the mobile robot position (x_{kk−1}) according to its proprioceptive data acquired from odometers (φ_{ l },φ_{ r }). The processing time of the prediction process is constant. It just updates the 3D vector containing the robot pose and its 3×3 covariance matrix. During the prediction step, the landmarks localization and uncertainties do not change: landmarks are defined in the global frame.
Correction process
The processing time of the correction process is not constant. The following of this section studies the processing time of each task of the process and their dependencies.
Matching task
Each landmark in the state vector must be projected in the camera frame using the pinhole model (see L. 2). The computing time of these projections depends only on the number of landmarks in the state vector (L. 2). For each projected landmark on the focal plane, ZMSSD matches an observation. Both the size of the descriptor and the size of the searching area τ will affect the computing time (see Equation (13)).
The processing time of the matching task depends on several parameters:

The number of landmarks in the state vector.

The number of visible landmarks on the focal plane.

The size of the descriptor.

Both the localization uncertainty of the mobile robot and the landmarks.
In practice, all the previously defined parameters should be set in order to bound the computing time. The first three parameters can be set by the users. The uncertainty depends on the followed path and cannot be bounded.
Estimation task
The estimation task uses the classical Kalman equations to update both the robot and landmarks uncertainties. The processing time of the estimation task is timeconsuming and depends on:

The number of landmarks in the state vector.

The number of landmarks observed.
The size of the matrix and thus the computing cost of the matrix multiplication in the Equations (11) and (12) depend on the number of landmarks in the state vector. Moreover, Equation (11) depends on the number of landmarks observed. As for the matching process, these parameters (size of the state vector and number of observations) should be bounded in order to achieve this estimation task in a constant computing time.
Initialization task
For each landmark under initialization, each particle (a feature depth hypothesis) is projected into the image, matched and its probability is reweighted. If there is a lack of landmarks under initialization, we add aspiring new landmarks. The processing time of the initialization task depends on:

The number of landmarks being initialized.

The size of the descriptor.

Both the localization uncertainty of the mobile robot and the landmark.
The number of landmarks being initialized and the size of the descriptors can be bounded. For each landmark being initialized, we have to update the probability of each localization hypothesis using a matching process. As for the matching task, the computing time depends on the localization uncertainty of the mobile robot and the landmarks.
Thresholds definition
Previous section shows that the computation time of each task of the EKFSLAM algorithm depends on many variables. For realtime implementation, it is important to get a constant, or at least a bounded computation time. To solve this constraint we have to:

set the maximum number of landmarks in the state vector. The size of the state vector will be fixed. Therefore, no dynamic memory allocation will be needed.

set the maximum number of landmarks observed. This keeps the computation time of the estimation task constant using a fixed size matrix multiplication.

set the maximum number of landmarks being initialized in order to bound the computation time of the initialization task. Unfortunately, it will not be sufficient to keep the computation time of the initialization task constant due to its internal matching step.

bound the computing time induced by the uncertainties. The only solution to get a bounded globalprocessingtime is to set a maximum execution time for the matching task. Due to the constant processing time of the prediction and the estimation task, the execution time of both the matching and initialization task can be bounded (33 ms  (t_{prediction}+ t_{estimation})). We chose to use all possible images (30 fps).We set a maximum execution time for the matching task. The algorithm proceeds in a way to match a maximum of landmarks in a bounded time. The initialization task has a dynamic execution time depending on the real processing time of the matching task and the number of landmarks being initialized. The lower bound of this dynamic execution allows at least a minimum number of landmarks to be initialized.
Map management
To keep the size of the state vector constant, we need to delete some landmarks when inserting new ones. The new state vector includes new landmarks (whose initialization has just been performed) and previously used landmarks. Auat Cheein and Carelli[26] proposes an efficient method to select landmarks for the estimation task. It is based on the evaluation of the influence of a given feature on the convergence of the state covariance matrix. The method matches all possible landmarks and computes (I−K_{ k }H_{ k }) from Equation (12). Unfortunately, we cannot implement it exactly as proposed by[26] due to the high computing time. We chose to add the landmarks, based on the previous estimation step, by selecting the previous landmarks which have the best previous influence on the convergence of the state covariance matrix. At time k, we select the landmark which had the smallest (I−K_{k−1}H_{k−1}).
Functional block partitioning
All the previously defined tasks do not have a fixed computing time, their computing time depends on the experiment. We have defined FBs which have a fixed computing times to optimize the implementation. The computing time of the FB do not depend on the experiment. Experiments will only affect the number of iterations of some FBs(3,4,5,6,7,8 and 9). From the previous algorithm, we have defined 9 FBs and their runtimes are studied in below Table1.
Each FB has a fixed computing time and some FB can occur more than one time (Landmark projection, ZMSSD, H_{ i }, Weight updating, Addition of a new landmark).
Processing time evaluation
As an application scenario, the robot moves over a square of 6 m side. At the end of the trajectory, it joined the initial starting position. Using only odometers, the final localization has an error of 1.6 m. With the EKFSLAM algorithm, the localization has been significantly improved. The final error is approximately 0.4 m. EKFSLAM includes all viewed landmarks in the state vector. Indeed, the localization result depends on the number of landmarks but the size of the state vector and the number of observations must be bounded to achieve a bounded computing time. The overall accuracy of the EKFSLAM depends on the number of the landmarks in the state vector and the matched observations. The accuracy of the localization depends monotonically on the number of processed landmarks.
The given EKFSLAM (Algorithm Algorithm 1 EKFSLAM) is processed sequentially on the embedded ARM processor operating at 500 MHz (no coprocessor is implemented). In the following, all times given correspond to times evaluated on the embedded system using the ARM processor. The data acquisition time is constant:

The odometer data acquisition is achieved in 0.7 ms (this processing time is due to the I2C communication with the Atmega168 processor).

Each image acquisition takes 1.8 ms (due to USB data transfer).
The prediction step does not require significant processing time, it takes only 0.093 ms per iteration. As for the matching task, the estimation task cannot be achieved in a constant processing time. Estimation task processing time depends on the total number of landmarks and the number of matched landmarks. Figure4 shows the processing time of the estimation task according to the number of landmarks in the state vector. The estimation task is entirely processed on the ARM processor (no use of coprocessor). Obviously it will be impossible to take into account all the landmarks detected when the algorithm is processed: the computation time will be higher than the 33 ms allowed. It is necessary to find a compromise between the number of landmarks and the processing time.
Experimental results
An experiment was conducted to evaluate the processing time of the different blocks of the algorithm (including tasks with unboundable processing time). For this experiment, we set the size of the descriptor to 16×16 pixels and we set the thresholds as follows:

Maximum number of landmarks in the state vector: 25.

Maximum number of observed landmarks: 25.

Maximum number of landmarks being initialized: 20.
First, we can analyze the runtime of the 8 previously defined FBs of the algorithm. We have used the integrated cycle counter register (CCNT) of the ARM processor to compute the processing time of each FB. The prediction process (FB1) occurs in 0.093 ms. Table2 summarizes, for the other FBs, the processing time per iteration, the mean of the number of iterations and the mean of the processing time per correction process. The estimation task could not be processed in some iterations of the correction process, especially when there is no matched landmark.
The mean processing time by frame is approximately 80.8 ms which corresponds to the sum of all processing times: prediction process (FB1) and correction process (FB2 to FB9). The processing time of the estimation task (FB6) is approximately 70.5 ms and it represents about 87% of the global processing time. The FAST detector (FB2) represents 3.4 ms. The ZMSSDM task (FB4) takes 2.63 ms per correction process. Finally, the initialization task (FB7, FB8 and FB9) takes 3.9 ms. These six FBs represent 99.6% of the global processing time. We focused on an efficient implementation of these FBs to enhance the global processing time.
Hardware–software optimization and improvements
OMAP3530 architecture description
The OMAP3530 is an heterogeneous architecture designed by TI (Texas Instruments) and implements an ARM CortexA8 500 MHz processor, a NEON coprocessor with SIMD instructions, a DSP C64x processor and a 3D graphics accelerator.
The NEON unit is similar to the MMX and SSE extensions existing on an X86 processor. It is optimized for Single Instruction Multiple Data (SIMD) operations. The NEON unit has two floating point pipelines, an integer pipeline and a 128 bits load/store/permute pipeline. An efficient implementation on the SIMD NEON architecture improves the processing time. NEON instructions perform “Packed SIMD” processing as follows:

Registers are considered as vectors of the same data type elements

Data types can be: signed/unsigned 8, 16, 32, 64bits or single precision floating point

Instructions perform the same operation on multiple data simultaneously as shown in Figure5. The number of simultaneous operations depends on the data type: NEON supports up to 16 operations at the same time using 8bits data.
SIMD optimization results
In the Algorithm Algorithm 1 EKFSLAM, the timeconsuming FBs are: the estimation block (FB6), the initialization blocks (FB7, FB8 and FB9), the FAST detector block (FB2) and the ZMSSDM block (FB4). FAST detector is already an optimized instance using machine learning[21]. Moreover, FAST has been already implemented on an FPGA based architecture[27]. We chose to optimize the other FBs. The matching task computes ZMSSD which computes the image correlation. It performs the same operation (addition, subtraction, multiplication and comparison) on 8 bits data. The computation of the ZMSSD can be optimized using the SIMD NEON architecture. The estimation task is based on floating point matrix multiplication, it could efficiently be optimized using the SIMD NEON coprocessor (the ARM Cortex A8 does not include any floating point unit (FPU)). The initialization FBs will be studied at Section “Parallel implementation on a DSPprocessor”.
ZMSSD (FB4)
The EKFSLAM matches features using ZMSSD. ZMSSD is computed for each landmark using Equation 2. We chose to use a descriptor with 16×16 8bits pixels size due to the efficiency of SIMD NEON architecture to deal with 128/64 bits vectors.
Basic implementation
The basic implementation of the ZMSSD function block computes the means of the pixel values in a window m_{ i }(m_{ d } can be precalculated when the landmark is detected). Then the ZMSSD (ZMSSD) is computed using loops (Algorithm Algorithm 2 Basic ZMSSDł::bel alg:BasicZMSSD).
Algorithm 2 Basic ZMSSD
1: m_{ i }←0
2: ZMSSD←0
3: for Each i ∈[0;des−1], j∈[0;des−1]do
4:${m}_{i}\leftarrow {m}_{i}+\text{im}({p}_{x}+i\frac{\text{des}}{2},{p}_{y}+j\frac{\text{des}}{2})$
5: end for
6:${m}_{i}\leftarrow {m}_{i}/(\text{des}\times \text{des})$
7: for Each i ∈[0;des−1], j ∈[0;des−1]do
8:$\text{ZMSSD}\leftarrow \text{ZMSSD}+\sum _{i,j}{\left(\right(d(i,j){m}_{d})(\text{im}({p}_{x}+i\frac{\text{des}}{2},{p}_{y}+j\frac{\text{des}}{2}){m}_{i}\left)\right)}^{2}$
9: end for
This implementation takes 12.60 μ s on the ARM processor.
Efficient scalar implementation
The second implementation aims to modify the calculation of ZMSSD in order to avoid the use of two loops. Formally the ZMSSD is written as:
where d=d(i,j) and$\text{im}=\text{im}({p}_{x}+i\frac{\text{des}}{2},{p}_{y}+j\frac{\text{des}}{2})$
By expanding the ZMSSD, we obtain:
Using${m}_{d}=\sum _{\mathrm{kl}}\frac{d(k,l)}{\text{des}\times \text{des}}$ and${m}_{i}=\sum _{\mathrm{kl}}\frac{\text{im}({p}_{x}+k\frac{\text{des}}{2},{p}_{y}+l\frac{\text{des}}{2})}{\text{des}\times \text{des}}$, we simplify the sum:
The equation becomes:
Using the notation:

$\text{Sd}=\sum _{i,j}d(i,j)$ the sum of the descriptor pixels (this sum can be precalculated).

$\text{Si}=\sum _{i,j}\text{im}({p}_{x}+i\frac{\text{des}}{2},{p}_{y}+j\frac{\text{des}}{2})$ the sum of the image pixels.

$\text{SSi}=\sum _{i,j}\text{im}({p}_{x}+i\frac{\text{des}}{2},{p}_{y}+j\frac{\text{des}}{2})\times \text{im}({p}_{x}+i\frac{\text{des}}{2},{p}_{y}+j\frac{\text{des}}{2})$ the sum of squared image pixel values.

$\text{SSd}=\sum _{i,j}d(i,j)\times d(i,j)$ the sum of the squared descriptor pixel values (this sum can be precalculated).

$\text{Sdi}=\sum _{i,j}d(i,j)\text{im}({p}_{x}+i\frac{\text{des}}{2},{p}_{y}+j\frac{\text{des}}{2})$ the sum of the product of the descriptor pixels and the image pixels.
The final equation is:
The implementation of Algorithm Algorithm 2 Basic ZMSSDł::bel alg:BasicZMSSD becomes Algorithm Algorithm 3 Efficient scalar ZMSSD:
Algorithm 3 Efficient scalar ZMSSD
1: Si←0
2: SSi←0
3: Sdi←0
4: for Each i∈[0;des−1], j∈[0;des−1]do
5:$\mathrm{Si}\leftarrow \mathrm{Si}+\text{im}({p}_{x}+i\frac{\text{des}}{2},{p}_{y}+j\frac{\text{des}}{2})$
6:$\mathrm{SSi}\leftarrow \mathrm{SSi}+\text{im}({p}_{x}+i\frac{\text{des}}{2},{p}_{y}+j\frac{\text{des}}{2})$$\phantom{\rule{4.0pt}{0ex}}\times \text{im}({p}_{x}+i\frac{\text{des}}{2},{p}_{y}+j\frac{\text{des}}{2})$
7:$\mathrm{Sdi}\leftarrow \mathrm{Sdi}+\text{im}({p}_{x}+i\frac{\text{des}}{2},{p}_{y}+j\frac{\text{des}}{2})\times d(i,j)$
8: end for
9:$\text{ZMSSD}\leftarrow \left(\right((2\mathrm{Sd}\times \mathrm{Si})S{d}^{2}S{i}^{2})/(\text{des}\times \text{des}\left)\right)+\mathrm{SSi}+\mathrm{SSd}2\mathrm{Sdi}$
In this instance, we use only one loop. This reduces memory access. Using this implementation, the computing time decrease from 12.60μ s to 11.29μ s.
Vector implementation
SIMD NEON architecture allows vector processing and performs the same operation on all the vector processingunits. We have implemented a vectorized instance of the ZMSSD functional block as follows (Algorithm Algorithm 4 SIMD vectorized ZMSSD):
Algorithm 4 SIMD vectorized ZMSSD
1: V 8x 8V image←0 ⊳ V8x8: 8×8 bits vector
2: V 8x 8 V descriptor ← 0
3: V 16x 8 VSi←0 ⊳ V16x8: 8×16 bits vector
4: V 32x 4 VSSi←0 ⊳ V32x4: 4×32 bits vector
5: V 32x 4 VSdi←0
6: for Each i∈[0;des−1], j=0,8do
7:$V\text{image}\leftarrow {\text{load}}_{8}\left(\mathrm{im}\right({p}_{x}+i\frac{\text{des}}{2},{p}_{y}+j\frac{\text{des}}{2}\left)\right)$⊳ load 8 pixels
8: V descriptor←load_{8}(d(i,j))
9: VSi ← VSi + V image
10: VSSi ← VSSi + V image×V image
11: VSSi ← VSSi + V image×V image
12: end for
13: Si ← sum(VSi)⊳ Sums the component of a vectors
14: SSi ← sum(VSSi)
15: Sdi ← sum(VSdi)
16:$\text{ZMSSD}\leftarrow \left(\right((2\mathrm{Sd}\times \mathrm{Si})S{d}^{2}S{i}^{2})/256)+\mathrm{SSi}+\mathrm{SSd}2\mathrm{Sdi}$
This instance uses 8 pixels at time. SIMD NEON architecture allows computing eight addition or eight multiplication simultaneously. The processing time of the vector implementation decreases to 1.27 μ s.
Computation time results
Table3 summarizes the processing time of the three different implementations of the ZMSSD functional block. The SIMD implementation is approximately 10 times faster than a basic implementation.
Estimation (FB6)
ARM Cortex A8 do not integrate a FPU. That’s why the processing time of the estimation FB is significant (Figure4). To optimize the matrix multiplication, we have used the EIGEN3 library[28] which provides SIMD NEON optimized functions. Figure6 presents the results of the processingtime of the estimation task implemented on the ARM processor (nonoptimized task) and those using the SIMD NEON coprocessor (optimized task). The processing time of the optimized task is approximately eight times faster than those of the nonoptimized one. This gain is due to the lack of the FPU in the Cortex A8 and to the efficiency of the NEON to evaluate a multiply and accumulate instruction in only one CPU cycle.
Parallel implementation on a DSP processor
Digital signal processors (DSP) are usually used in vision systems[29]. They integrate a number of resources that serve to enhance image processing versatility. The use of digital signal processing with data sharing ensures that image processing will be achieved in parallel. With a DSP based image processing, it is possible to parallelize the EKFSLAM algorithm on the multiprocessor architecture (ARM, NEON and DSP processors). This allows enhancing the global processing time especially when we consider to operate in realtime constraints. The landmarks matching (FB3 to FB5) and the robot position estimation (FB6) tasks must be processed sequentially. Fortunately, the initialization tasks (FB7, FB8 and FB9) can run simultaneously with the matching and estimation tasks.
Rethinking the implementation to obtain a parallel implementation, the instance of Algorithm 1 with block partitioning leads to the Algorithm 5.
Algorithm 5 Multiprocessed EKFSLAM
1: Robot pose initialization
2: while localization is required do
3: if DATA = Odometers then
4: Prediction
5: else if DATA = Camera then
6: FAST detector
7: ARM Processor Matching and Estimation (FB 3, 4, 5 and 6)
8: DSP Processor Initialization (FB 7, 8 and 9)
9: end if
10: end while
The architecture of the OMAP3530 can interface the ARM and DSP processors using a shared memory. Figure7 shows the data transfer mechanism using a shared DDR memory area. For each acquired image, the ARM processor writes the image (320×240 pixels), the robot position and its uncertainty on the shared memory. Data transfer between the ARM processor and DSP processor for a 320×240 gray image is done in one millisecond. When the initialization of a landmark is completed, the DSP processor returns the position and the uncertainty of possible new landmarks.
Global results
We have improved the EKF SLAM implementation using the SIMD NEON coprocessor and the DSP processor. We have implemented the matching and estimation tasks on a NEON coprocessor and the initialization tasks on a DSP processor. FAST corner detector is already an optimized algorithm using machine learning[21]. For the latest experiment, we set the same thresholds as Section “Experimental results”.
Table4 summarizes the processing time per iteration and the mean processing time per Frame of each FB. The computing time of the initialization task (blocks 7, 8 and 9) implemented on the DSP processor is approximately 4.0 ms. The DSP processor computes the initialization task while the ARMNEON processors compute the prediction, FAST detection, matching and estimation tasks.
With this implementation and since the processingtime of the initialization task (4.0 ms) is smaller compared to the sum of the processing times of the matching and estimation tasks (13.0 ms for blocks 3, 4, 5 and 6), the overall computing time is reduced to the sum of the processingtimes of the prediction process (0.093 ms), the FAST detector (3.4 ms), the matching and estimation tasks (13.0 ms). The mean processing time per frame with the optimized implementation is 17.6 ms (we add 1 ms for the ARM/DSP data transfer) whereas the nonoptimized implementation has a processing time of 80.85 ms. The optimized processing time represents 22% of the nonoptimized one. The processing time has been reduced by 78%.
Conclusion
This article proposed an efficient implementation of the EKFSLAM algorithm on a multiprocessor architecture. The overall accuracy of the EKFSLAM depends on the number of the landmarks in the state vector and the matched observations. Both are linked to the time allowed to the embedded architecture to compute the robot pose. Based on the application constraints (realtime localization) and an evaluation methodology, we have implemented the algorithm in consideration of the underlying hardware architecture. A runtime analyses shows that the FBs and the initialization task represents 99.6% of the global processing time. We have used an optimized instance of the FAST detector. Two FBs (in matching and estimation tasks) have been optimized on an SIMD NEON architecture. The initialization task has been parallelized on a DSP processor. This optimization required a modification of the algorithm implementation. Using the optimized implementation, the global processing time was reduced by a factor equal to 4.7. The results demonstrate that an embedded systems (with a lowcost multiprocessor architecture) can operate under realtime constraints, if the software implementation is designed carefully. To scale with larger environment, we are going to include an approach of local/global mapping as proposed by[30]. Using this approach, we will be able to map larger environment. The map joining system will be implemented on the GPU coprocessor integrated on the OMAP3530.
Other future developments will be centered around a Hardware–software codesign to improve the system performances implementing a systemonchip with a field programmable gate array (FPGA). The use of a configurable architecture accelerates greatly the design and validation of a proof of realtime and systemonchip concept.
References
 1.
Dissanayake M, Newman P, Clark S, DurrantWhyte H, Csorba M: A solution to the simultaneous localization and map building (SLAM) problem. IEEE Trans. Robot. Autom 2001, 17: 229241. 10.1109/70.938381
 2.
Davison A, Reid I, Molton N, Stasse O: MonoSLAM: realtime single camera SLAM. IEEE Trans. Pattern Anal. Mach. Intell 2007, 29: 10521067.
 3.
Montemerlo M, Thrun S, Koller D, Wegbreit B: FastSLAM: a factored solution to the simultaneous localization and mapping problem. National Conference on Artificial Intelligence, Orlando, Florida, USA, 2002, pp. 593–598
 4.
Folkesson J, Christensen HI: Graphical SLAMa selfcorrecting map. IEEE International Conference on Robotics and Automation, LA, New Orleans, USA, 2004, pp. 383–390
 5.
Eliazar A, Parr R: DPSLAM: fast, robust simultaneous localization and mapping without predetermined landmarks. International Joint Conference on Artificial Intelligence vol. 18. Acapulco, Mexico, 2003, pp. 1135–1142
 6.
Thrun S: Probabilistic robotics. Assoc. Comput. Mach 2002, 45(3):5257.
 7.
Brenneke C, Wulf O, Wagner B: Using 3d laser range data for slam in outdoor environments. IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, Nevada, USA, 2003, pp. 188–193
 8.
Prusak A, Melnychuk O, Roth H, Schiller I: Pose estimation and map building with a timeofflightcamera for robot navigation. Int. J. Intell. Syst. Technol. Appl 2008, 5(3):355364.
 9.
Abrate F, Bona B, Indri M: Experimental EKFbased SLAM for minirovers with IR sensors only. European Conference on Mobile Robots, Freiburg, Germany, 2007
 10.
Yap T, Shelton C: SLAM in large indoor environments with lowcost, noisy, and sparse sonars. IEEE International Conference on Robotics and Automation, Kobe, Japan, 2009, pp. 1395–1401
 11.
Gifford C, Webb R, Bley J, Leung D, Calnon M, Makarewicz J, Banz B, Agah A: Lowcost multirobot exploration and mapping. IEEE International Conference on Technologies for Practical Robot Applications, Woburn, Massachusetts, USA, 2008, pp. 74–79
 12.
Magnenat S, Longchamp V, Bonani M, Rétornaz P, Germano P, Bleuler H, Mondada F: Affordable SLAM through the codesign of hardware and methodology,. IEEE International Conference on Robotics and Automation, Anchorage, Alaska, 2010, pp. 5395–5401
 13.
Montemerlo M, Thrun S, Koller D, Wegbreit B: FastSLAM 2.0: An improved particle filtering algorithm for simultaneous localization and mapping that provably converges. International Joint Conference on Artificial Intelligence, Acapulco, Mexico, 2003, pp. 1151–1156
 14.
Rezaei S, Guivant J, Nebot E: Carlike robot path following in large unstructured environments. IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, Nevada, USA, 2003, pp. 2468–2473
 15.
Schröter C, Böhme H, Gross H: Memoryefficient gridmaps in RaoBlackwellized particle filters for SLAM using sonar range sensors. European Conference on Mobile Robots, Freiburg, Germany, 2007, pp. 138–143
 16.
Konolige K, Augenbraun J, Donaldson N, Fiebig C, Shah c: A lowcost laser distance sensor. IEEE International Conference on Robotics and Automation, Pasadena, California, USA, 2008, pp. 3002–3008
 17.
Pirjanian P, Karlsson N, Goncalves L, Di Bernardo E: Lowcost visual localization and mapping for consumer robotics. Indust. Robot 2003, 30(2):139144. 10.1108/01439910310464159
 18.
Seignez E, Kieffer M, Lambert A, Walter E, Maurin T: Realtime boundederror state estimation for vehicle tracking. IEEE Int. J. Robot. Res 2009, 28: 3448. 10.1177/0278364908096237
 19.
Siegwart R, Nourbakhsh I: Introduction to Autonomous Mobile Robots. (The MIT Press, London, 2004)
 20.
Vincke B, Elouardi A, Lambert A: Design and evaluation of an embedded system based SLAM applications. IEEE/SICE International Symposium on System Integration, Sendai, Japan, 2010, pp. 224–229
 21.
Rosten E, Porter R, Drummond T: Faster and better: a machine learning approach to corner detection. IEEE Trans. Pattern Anal. Mach. Intell 2009, 32: 105119.
 22.
Davison A: Realtime simultaneous localisation and mapping with a single camera. IEEE International Conference on Computer Vision, Nice, France, 2003, pp. 1403–1410
 23.
Munguia R, Grau AFreiburg, Germany, 2007, pp. 1–6
 24.
Seignez E, Lambert A, Maurin T: An experimental platform for testing localization algorithms. IEEE International Conference On Information And Communication Technologies: From Theory To Application, Damascus, Syria, 2006, pp. 748–753
 25.
Elouardi A, Bouaziz S, Dupret A, Lacassagne L, Klein JO, Reynaud R: A smart architecture for lowlevel image computing. International Journal on Computer Science and Applications, 2008, pp. 1–19
 26.
Auat Cheein F, Carelli R: Analysis of different feature selection criteria based on a covariance convergence perspective for a SLAM algorithm. Sensors 2010, 11: 6289. 10.3390/s110100062
 27.
Kraft M, Schmidt A, Kasinski A: Highspeed image feature detection using FPGA implementation of fast algorithm. International Conference on Computer Vision Theory and Applications, Funchal, Madeira, Portugal, 2008, pp. 174–179
 28.
Eigen, 2012.http://eigen.tuxfamily.org/
 29.
Gunnam K, Hughes D, Junkins J, Kehtarnavaz N: A visionbased DSP embedded navigation sensor. IEEE Sens. J 2002, 2(5):428442. 10.1109/JSEN.2002.806212
 30.
Piniés P, Tardós J: Largescale slam building conditionally independent local maps: application to monocular vision. IEEE Trans. Robot 2008, 24(5):10941106.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Received
Accepted
Published
DOI
Keywords
 Mobile Robot
 Digital Signal Processor
 Single Instruction Multiple Data
 Estimation Task
 Initialization Task