Real time simultaneous localization and mapping: towards low-cost multiprocessor embedded systems

Simultaneous localization and mapping (SLAM) is widely used by autonomous robots operating in unknown environments. Research community has developed numerous SLAM algorithms in the last 10 years. Several works have presented many algorithms’ optimizations. However, they have not explored a system optimization from the system hardware architecture to the algorithmic development level. New computing technologies (SIMD coprocessors, DSP, multi-cores) can greatly accelerate the system processing but require rethinking the algorithm implementation. This article presents an eﬃcient implementation of the EKF-SLAM algorithm on a multi-processor architecture. The algorithm-architecture adequacy aims to optimize the implementation of the SLAM algorithm on a low-cost and heterogeneous architecture (implementing an ARM processor with SIMD coprocessor and a DSP core). Experiments were conducted with an instrumented platform. Results aim to demonstrate that an optimized implementation of the algorithm, resulting from an optimization methodology, can help to design embedded systems implementing low-cost multiprocessor architecture operating under real-time constraints.


Introduction
Autonomous robots must be able to localize themselves. Simultaneous localization and mapping (SLAM) algorithms aim to build an environment map while estimating the robot pose. Many researches were conducted to develop SLAM algorithms like extended Kalman filter for SLAM (EKF-SLAM) [1,2], FAST SLAM [3], GRAPH SLAM [4], DP-SLAM [5] which aim to improve consistency, accuracy or robustness. Other algorithms derivate from the EKF-SLAM, such as algorithms using unscented Kalman filter (UKF) [6] which increases the localization accuracy against the classical EKF algorithm based on a linearized model. Only few works deal with the implementation of low-cost SLAM embedded systems.
Most of SLAM implementations rely on the use of accurate and dense measurements provided by expensive sensors like laser rangefinder sensors [7] or time of flight cameras [8]. High-priced smart sensors are not suitable to *Correspondence: bastien.vincke@u-psud.fr 1 Univ Paris-Sud, CNRS, Institut d'Electronique Fondamentale, F-91405 Orsay, France Full list of author information is available at the end of the article be integrated in most of embedded systems in commercial objectives or industrial applications.
Simultaneous localization and mapping systems using low-cost sensors have been recently designed. Abrate et al. [9] provide an implementation of the EKF-SLAM algorithm on a Khepera robot. The robot hosts limited range, sparse and noisy IR sensors. Experimental results have shown the importance of the sensor characteristics, the primitives (lines) extraction and data association. Yap and Shelton [10] use cheap, noisy and sparse sonar sensors embedded in a P3-DX robot. To cope with these low-cost sensors, the implemented SLAM algorithm uses a multiscan approach and an orthogonality assumption to map indoor environments.
Classical SLAM algorithms are too computationally intensive to run on an embedded computing unit. They require at least laptop-level performances. Gifford et al. [11] present a low-cost approach to autonomous multirobot mapping and exploration for unstructured environments. The robot hosts a Gumstix computing unit (600 Mhz), 6 IR scanning range arrays, a 3-axis gyroscope and odometers. Running DP-SLAM alone on the Gumstix with 15 particles takes on average 3 s per update. While http://jes.eurasipjournals.com/content/2012/1/5 using 25 particles, it takes more than 10 s per update. Authors have underlined the difficulty to find the right SLAM parameters to fit within the available computing power and the real-time processing. Magnenat et al. [12] present a system based on the co-design of a low-cost sensor (a slim rotating scanner), a SLAM algorithm, a computing unit, and an optimization methodology. The computing unit is based on an ARM processor (533 Mhz) running a FASTSLAM 2.0 algorithm [13]. Magnenat et al. [12] use an evolution strategy to find the best configuration of the algorithm and setting of the parameters.
As pointed out by [11,12], the first improvement of a SLAM algorithm is an efficient setting of the various parameters of the algorithm. Other modifications were investigated to reach real-time constraints. These modifications are necessary due to the low computing power and limited memory resources available on embedded systems. Features restriction for EKF-SLAM algorithm has been implemented to decrease the processing time [14]. Schroter et al. [15] focused on reducing the memory footprint of particle-based gridmap SLAM by sharing the map between several particles.
Robust laser-based SLAM navigation has long existed in robot applications, but systems implement sensors that, in some cases, are more expensive than the final product. Neato Robotics has developed a vacuum cleaner that implements a navigation system using a SLAM algorithm. The approach is based on a low-cost system implementing a designed laser rangefinder [16].
This article presents an efficient implementation of the EKF-SLAM algorithm on a multi-processor architecture. The approach is based on an algorithm implementation adequate to a defined architecture. The aim is to optimize the implementation of the SLAM algorithm on a low-cost and heterogeneous architecture implementing an SIMD coprocessor (NEON) and a DSP core. The hardware includes several low-cost sensors. As [17], we chose to use a low-cost camera (exteroceptive sensor) and odometers (proprioceptive sensors). Following [12], we efficiently tune the parameters of the SLAM algorithm. We improve on previous works by proposing an adequate implementation of the EKF-SLAM algorithm on a multiprocessing architecture (ARM processor, SIMD NEON coprocessor, DSP core). The specifications related to the NEON coprocessor and the DSP core improve the processing time and the system performance. Results aim to demonstrate that an optimized implementation of the algorithm, resulting from an evaluation methodology, can help to design embedded systems implementing low-cost multiprocessor architecture operating under real-time constraints.
Section "EKF-SLAM algorithm" introduces the EKF-SLAM algorithm. Section "Multiprocessor architecture and system configuration" presents the embedded multiprocessor architecture and the system configuration.
Section "Evaluation methodology and algorithm implementation" details the evaluation methodology, provides a first algorithm implementation and analyzes this implementation in terms of processing time. A Hardwaresoftware optimization is proposed and analyzed in Section "Hardware-software optimization and improvements". It presents SIMD optimizations and DSP parallelization. A performance comparison is then performed between the optimized and non-optimized instances. Finally, Section "Conclusion" concludes this article.

EKF-SLAM algorithm
Overview Extended Kalman filter for SLAM estimates a state vector containing both the robot pose and the landmark locations. We consider that the robot is moving on a plane. The algorithm uses 3D points as landmarks. It uses proprioceptive sensors to compute a predicted vector and then corrects this state using exteroceptive sensors. In this article, we consider a wheeled robot embedding two odometers (attached to each rear wheel) and a camera.

State vector and covariance matrix
With N landmarks, the state vector is defined as: where: • x,z are the ground coordinates (x-axis, z-axis) of the robot rear axle center. We suppose that the robot is always moving on the ground, so y = 0 (no elevation) and y does not appear in Equation (1). • θ is the orientation of a local frame attached to the robot with respect to the global frame. • x a 1 , y a 1 , z a 1 , . . . , x a N , y a N , z a N are the 3D coordinates of the N landmarks in the global frame.
The state covariance matrix is defined as:

Prediction
The prediction step relies on the measurements of the proprioceptive sensors, the odometers, embedded on our experimental platform. A non linear discrete-time statespace model is considered to describe the evolution of the robot configuration x: where u k is a known two-dimensional control vector, assumed constant between the times indexed by k − 1 and k, and v k is an unknown state perturbation vector that accounts for the model uncertainties. x k−1|k−1 represents the state vector at time k-1, x k|k−1 represented the state vector after the prediction step, x k|k represents the state vector after the estimation step. The classical evolution model, described in [18], is considered: where u k = (δs, δθ); δs is the longitudinal motion and δθ is the rotational motion [19]: where: • w r and w l are respectively the radius of the right and left wheel. • e is the length of the rear axle.
• δϕ i = δp i 2π ρ with i ∈ {r, l} (r=right, l =left), δp i : number of steps, ρ: odometer resolution. δϕ i is the angular movement of the right/left wheel.
The state covariance matrix is defined as: where • Q k is the covariance matrix of the process noise.

Estimation
The estimation of the state is made using the camera which returns the position in the image (u i , v i ) of the i-th landmark.
The innovation and its covariance matrix: The pinhole model is used to project a known landmark position into the image: ⎛ where: is the position of the i -th landmark in the camera frame.
• f is the focal length.
• (k u , k v ) is the number of pixels per unit length.
• s uv is a factor accounting for the skew due to non-rectangular pixels. In our case, we take s uv =0.
Equation (7) can be written as the predicted observation equation for a single landmark: The pose of a landmark in the camera frame is defined from its pose (x a i ,y a i ,z a i ) in the global frame: Where D is the length between the camera and the robot rear axle center.
During the observation step, the algorithm matches M landmarks (M <= N) whose observations are added in Thus, the innovation is: whereẑ k is the measurement for all the M predicted observations. The innovation covariance S k is: where H k is the Jacobian of h k and R k is the observation noise covariance. http://jes.eurasipjournals.com/content/2012/1/5 State estimation: The state is updated using the classical EKF equations:

Visual landmarks
The landmarks used in the observation equation are extracted from images. Landmark initialization defines the initial coordinates and the initial covariance of landmarks localization (also called interest points or features).
In [20], we have evaluated the processing time of corner detectors like Harris, Shi-Tomasi or FAST. Harris and Shi-Tomasi detectors were more time consuming than the FAST detector and do not provide significantly better localization results than FAST. Consequently, there is no need to implement more sophisticated algorithms such as Harris or Shi and Tomasi. FAST [21] (Features from Accelerated Segment Test) corner detector relies on a simple test performed for a pixel p by examining a circle of 16 pixels (a Bresenham circle of radius 3) centered on p. A feature is detected at p if the intensities of at least 12 contiguous pixels are all above or all below the intensity of p with a threshold t. Even if this detector is not highly robust to noises and depends on a threshold it produces stable landmarks and is computationally very efficient [21]. The FAST detector [21] is related to the wedge-model style of detector evaluated using a circle surrounding a candidate pixel. To optimize the detector processing-time, this model is used to made a decision classifier which is applied to the image ( Figure 1).

Matching based on zero-mean sum of squared differences
The EKF-SLAM matches the previously detected feature with a new one using zero-mean sum of squared differences (ZMSSD).
The covariance of the projected feature localization defines a searching area τ . This area includes the robot localization uncertainty and the landmarks localization uncertainty. We use the ZMSSD to find the best candidate point inside τ . For each candidate point p : (p x , p y ), the N p value of the weighted ZMSSD is: and where: • w(p x , p y ) is the Gaussian weights defined by the landmark covariance. The observation p obs will be selected using The descriptor, used to identify the landmark during the matching, is classically a small image window of 9×9 pixels to 16×16 pixels around the interest point. Davison [22] claims that this sort of descriptor is able to serve as long-term landmark feature.

Landmark initialization based on davison method
Landmark initialization consists of defining the initial coordinates and the initial covariance of landmarks (interest points). Various methods exist and can be classified as an undelayed or delayed method. Undelayed method adds landmarks with only one measurement whereas the delayed method needs two or more frames. We chose to use the widely spread delayed method proposed by [2] which is both efficient and adequate to implement. Furthermore the work of Munguia and Grau [23] shows that the delayed method have the same performance as the undelayed method.
In order to compute the 3D depth of a newly detected landmark, as [2], we initialize a 3D line into the map along which the landmark must lie. This line starts at the estimated camera position and heads to infinity along the feature viewing direction. The line is composed of 100 particles which represent depth hypothesis. The prior probability used is uniform and the range is 0.5 to 15 m. At subsequent time, each particle (a feature depth hypothesis) is projected into the image, matched and its probability is re-weighted.
When the ratio of the standard deviation of depth to the expected value is below a threshold, the distribution is approximated as a Gaussian and the landmark is initialized. The landmark pose A i = (x a i , y a i , z a i ) is added to x and the A i covariance is added into P.

Multiprocessor architecture and system configuration
In order to test and validate the EKF-SLAM algorithm, experiments were conducted with an instrumented mobile robot called Minitruck [24]. The platform was teleoperated during the experiments. For our first evaluation, the experiment consists to operate inside a large corridor of our research lab (see Figure 2).
We have developed a system architecture on the top of a multi-processor board (Gumstix Overo) based on the OMAP3530 chip (see Figure 3). The OMAP chip integrate a RISC processors (ARM Cortex A8 500 Mhz) with an SIMD NEON coprocessor, a DSP (TMS320C64x + 430 Mhz) and a graphical processor unit (POWERVR SGX). This board communicates with an additional processor for control and data acquisition (Atmega168 16 Mhz).
Multiple sensors (odometers and a camera) are interfaced to this architecture ( Figure 2). The variety of sensors enables us to evaluate the SLAM algorithms with different types of sensor data and take advantage of the information complementary of these sensors. Our objective is to  evaluate the implementation of SLAM algorithms using land vehicles and sensors, like steering encoders and a camera.
The use of wheel and steer encoders is obvious in robotics and navigation. Simple kinematic motion models can be used to integrate velocity and heading measurements from wheel and steer encoders to provide an estimation of the mobile robot location and orientation. Estimations are regularly subject to considerable errors due to misalignment, offsets and wheels slippage. It is possible to implement basic models to approximate and correct offset and slippage errors on-line leading to significant improvement of performances. We chose two HEDS 5540 odometers for our experimental vehicle.
The feature detection in SLAM application relies on the embedded sensors. We chose to achieve this extraction using a vision sensor (a cheap USB webcam, Philips SPC530NC, delivering 30 fps). We chose to use all possible images (30 fps) because it is much easier to perform point matching if the movement is small. Conventional approaches for vision systems design are usually based on general purpose computers interfaced with cameras. The new computing technologies (SIMD, DSP, multi-cores) can greatly accelerate algorithm processing, but require rethinking these algorithms by optimizing the parallelism. This parallel processing is pushed to integrate near the sensors parallel computing units [25]. We have used a Gumstix processing module based on OMAP3530 architecture. It is an heterogeneous architecture (ARM Cortex-A8 500 Mhz processor with power consumption less than 300 mW, SIMD NEON integrated coprocessor, DSP C64x processor and a 3D graphics accelerator) that communicates via a WLAN connection (802.11 g).
The WLAN connection is used only to control speed and direction of the vehicle. In the future, a dedicated algorithm to autonomous navigation will be implemented http://jes.eurasipjournals.com/content/2012/1/5 and thus the WLAN connection will be used to achieve only the system monitoring. A coprocessor (ATMega168) takes care of data acquisition. It controls the robot speed and its direction using two pulse-width modulation (PWM) signals. It decodes signals coming from the odometers embedded in the rear wheels. It communicates with the main board using an I2C interface. This interface allows the main processor to retrieve odometers data and to send instructions corresponding to speed and direction.
To evaluate the designed system, an experiment was achieved in a corridor of our lab. Frames have been grabbed at 30 fps with 320×240 resolution. Odometer data were sampled at 30 Hz. During the experiment, references are periodically drawn on the ground by an embedded marker.

Evaluation methodology and algorithm implementation
Our evaluation methodology is based on the identification of the processing tasks requiring a significant computing time. It is based on several steps: we analyze first the execution time of tasks and their dependencies on the algorithm's parameters. A threshold is fixed for each parameter. The algorithm is then partitioned in order to have functional blocks (FBs) performing defined calculations. Each block is then evaluated to determine its processing time. Function blocks that require the most important execution time are then optimized to reduce the global processing time.

Prediction process
This phase updates the mobile robot position (x k|k−1 ) according to its proprioceptive data acquired from odometers (ϕ l , ϕ r ). The processing time of the prediction process is constant. It just updates the 3D vector containing the robot pose and its 3×3 covariance matrix. During the prediction step, the landmarks localization and uncertainties do not change: landmarks are defined in the global frame.

Correction process
The processing time of the correction process is not constant. The following of this section studies the processing time of each task of the process and their dependencies.
Matching task Each landmark in the state vector must be projected in the camera frame using the pinhole model (see L. 2). The computing time of these projections depends only on the number of landmarks in the state vector (L. 2). For each projected landmark on the focal plane, ZMSSD matches an observation. Both the size of the descriptor and the size of the searching area τ will affect the computing time (see Equation (13)). http://jes.eurasipjournals.com/content/2012/1/5 The processing time of the matching task depends on several parameters: • The number of landmarks in the state vector.
• The number of visible landmarks on the focal plane.
• The size of the descriptor.
• Both the localization uncertainty of the mobile robot and the landmarks.
In practice, all the previously defined parameters should be set in order to bound the computing time. The first three parameters can be set by the users. The uncertainty depends on the followed path and cannot be bounded.

Estimation task
The estimation task uses the classical Kalman equations to update both the robot and landmarks uncertainties. The processing time of the estimation task is time-consuming and depends on: • The number of landmarks in the state vector.
• The number of landmarks observed.
The size of the matrix and thus the computing cost of the matrix multiplication in the Equations (11) and (12) depend on the number of landmarks in the state vector. Moreover, Equation (11) depends on the number of landmarks observed. As for the matching process, these parameters (size of the state vector and number of observations) should be bounded in order to achieve this estimation task in a constant computing time.
Initialization task For each landmark under initialization, each particle (a feature depth hypothesis) is projected into the image, matched and its probability is re-weighted. If there is a lack of landmarks under initialization, we add aspiring new landmarks. The processing time of the initialization task depends on: • The number of landmarks being initialized.
• The size of the descriptor.
• Both the localization uncertainty of the mobile robot and the landmark.
The number of landmarks being initialized and the size of the descriptors can be bounded. For each landmark being initialized, we have to update the probability of each localization hypothesis using a matching process. As for the matching task, the computing time depends on the localization uncertainty of the mobile robot and the landmarks.

Thresholds definition
Previous section shows that the computation time of each task of the EKF-SLAM algorithm depends on many variables. For real-time implementation, it is important to get a constant, or at least a bounded computation time. To solve this constraint we have to: • set the maximum number of landmarks in the state vector. The size of the state vector will be fixed. Therefore, no dynamic memory allocation will be needed. • set the maximum number of landmarks observed.
This keeps the computation time of the estimation task constant using a fixed size matrix multiplication. • set the maximum number of landmarks being initialized in order to bound the computation time of the initialization task. Unfortunately, it will not be sufficient to keep the computation time of the initialization task constant due to its internal matching step. • bound the computing time induced by the uncertainties. The only solution to get a bounded global-processing-time is to set a maximum execution time for the matching task. Due to the constant processing time of the prediction and the estimation task, the execution time of both the matching and initialization task can be bounded (33 ms -(t prediction + t estimation )). We chose to use all possible images (30 fps).We set a maximum execution time for the matching task. The algorithm proceeds in a way to match a maximum of landmarks in a bounded time. The initialization task has a dynamic execution time depending on the real processing time of the matching task and the number of landmarks being initialized. The lower bound of this dynamic execution allows at least a minimum number of landmarks to be initialized.

Map management
To keep the size of the state vector constant, we need to delete some landmarks when inserting new ones. The new state vector includes new landmarks (whose initialization has just been performed) and previously used landmarks. Auat Cheein and Carelli [26] proposes an efficient method to select landmarks for the estimation task. It is based on the evaluation of the influence of a given feature on the convergence of the state covariance matrix. The method matches all possible landmarks and computes (I − K k H k ) from Equation (12). Unfortunately, we cannot implement it exactly as proposed by [26] due to the high computing time. We chose to add the landmarks, based on the previous estimation step, by selecting the previous landmarks which have the best previous influence on the convergence of the state covariance matrix. At time k, we select the landmark which had the smallest (I − K k−1 H k−1 ). http://jes.eurasipjournals.com/content/2012/1/5  (3,4,5,6,7,8 and 9). From the previous algorithm, we have defined 9 FBs and their runtimes are studied in below Table 1. Each FB has a fixed computing time and some FB can occur more than one time (Landmark projection, ZMSSD, H i , Weight updating, Addition of a new landmark).

Processing time evaluation
As an application scenario, the robot moves over a square of 6 m side. At the end of the trajectory, it joined the initial starting position. Using only odometers, the final localization has an error of 1.6 m. With the EKF-SLAM algorithm, the localization has been significantly improved. The final error is approximately 0.4 m. EKF-SLAM includes all viewed landmarks in the state vector. Indeed, the localization result depends on the number of landmarks but the size of the state vector and the number of observations must be bounded to achieve a bounded computing time. The overall accuracy of the EKF-SLAM depends on the number of the landmarks in the state vector and the matched observations. The accuracy of the localization depends monotonically on the number of processed landmarks.
The given EKF-SLAM (Algorithm 1) is processed sequentially on the embedded ARM processor operating at 500 MHz (no coprocessor is implemented). In the following, all times given correspond to times evaluated on the embedded system using the ARM processor. The data acquisition time is constant: • The odometer data acquisition is achieved in 0.7 ms (this processing time is due to the I2C communication with the Atmega168 processor). • Each image acquisition takes 1.8 ms (due to USB data transfer).
The prediction step does not require significant processing time, it takes only 0.093 ms per iteration. As for the matching task, the estimation task cannot be achieved in a constant processing time. Estimation task processing time depends on the total number of landmarks and the number of matched landmarks. Figure 4 shows the processing time of the estimation task according to the number of landmarks in the state vector. The estimation task is entirely processed on the ARM processor (no use of coprocessor). Obviously it will be impossible to take into account all the landmarks detected when the algorithm is processed: the computation time will be Figure 4 Processing time of the estimation task. http://jes.eurasipjournals.com/content/2012/1/5 higher than the 33 ms allowed. It is necessary to find a compromise between the number of landmarks and the processing time.

Experimental results
An experiment was conducted to evaluate the processing time of the different blocks of the algorithm (including tasks with unboundable processing time). For this experiment, we set the size of the descriptor to 16 × 16 pixels and we set the thresholds as follows: • Maximum number of landmarks in the state vector: 25. First, we can analyze the runtime of the 8 previously defined FBs of the algorithm. We have used the integrated cycle counter register (CCNT) of the ARM processor to compute the processing time of each FB. The prediction process (FB1) occurs in 0.093 ms. Table 2 summarizes, for the other FBs, the processing time per iteration, the mean of the number of iterations and the mean of the processing time per correction process. The estimation task could not be processed in some iterations of the correction process, especially when there is no matched landmark. The mean processing time by frame is approximately 80.8 ms which corresponds to the sum of all processing times: prediction process (FB1) and correction process (FB2 to FB9). The processing time of the estimation task (FB6) is approximately 70.5 ms and it represents about 87% of the global processing time. The FAST detector (FB2) represents 3.4 ms. The ZMSSD-M task (FB4) takes 2.63 ms per correction process. Finally, the initialization task (FB7, FB8 and FB9) takes 3.9 ms. These six FBs represent 99.6% of the global processing time. We focused on an efficient implementation of these FBs to enhance the global processing time.

OMAP3530 architecture description
The OMAP3530 is an heterogeneous architecture designed by TI (Texas Instruments) and implements an ARM Cortex-A8 500 MHz processor, a NEON coprocessor with SIMD instructions, a DSP C64x processor and a 3D graphics accelerator.
The NEON unit is similar to the MMX and SSE extensions existing on an X86 processor. It is optimized for Single Instruction Multiple Data (SIMD) operations. The NEON unit has two floating point pipelines, an integer pipeline and a 128 bits load/store/permute pipeline. An efficient implementation on the SIMD NEON architecture improves the processing time. NEON instructions perform "Packed SIMD" processing as follows: • Registers are considered as vectors of the same data type elements • Data types can be: signed/unsigned 8, 16, 32, 64-bits or single precision floating point • Instructions perform the same operation on multiple data simultaneously as shown in Figure 5. The number of simultaneous operations depends on the data type: NEON supports up to 16 operations at the same time using 8-bits data.

SIMD optimization results
In the Algorithm 1, the time-consuming FBs are: the estimation block (FB6), the initialization blocks (FB7, FB8 and FB9), the FAST detector block (FB2) and the ZMSSD-M block (FB4). FAST detector is already an optimized instance using machine learning [21]. Moreover, FAST has been already implemented on an FPGA based architecture [27]. We chose to optimize the other FBs. The matching task computes ZMSSD which computes the image correlation. It performs the same operation (addition, subtraction, multiplication and comparison) on  8 bits data. The computation of the ZMSSD can be optimized using the SIMD NEON architecture. The estimation task is based on floating point matrix multiplication, it could efficiently be optimized using the SIMD NEON coprocessor (the ARM Cortex A8 does not include any floating point unit (FPU)). The initialization FBs will be studied at Section "Parallel implementation on a DSP processor".

ZMSSD (FB4)
The EKF-SLAM matches features using ZMSSD. ZMSSD is computed for each landmark using Equation 2. We chose to use a descriptor with 16×16 8-bits pixels size due to the efficiency of SIMD NEON architecture to deal with 128/64 bits vectors.

Basic implementation
The basic implementation of the ZMSSD function block computes the means of the pixel values in a window m i (m d can be precalculated when the landmark is detected). Then the ZMSSD (ZMSSD) is computed using loops (Algorithm 2).

Algorithm 2 Basic ZMSSD
This implementation takes 12.60 μs on the ARM processor.

Efficient scalar implementation
The second implementation aims to modify the calculation of ZMSSD in order to avoid the use of two loops. Formally the ZMSSD is written as: where d = d(i, j) and im = im(p x + i − des 2 , p y + j − des 2 ) By expanding the ZMSSD, we obtain: The equation becomes: 22) http://jes.eurasipjournals.com/content/2012/1/5 Using the notation: • Sd = i,j d(i, j) the sum of the descriptor pixels (this sum can be precalculated).
2 ) the sum of squared image pixel values. • SSd = i,j d(i, j)×d(i, j) the sum of the squared descriptor pixel values (this sum can be precalculated).
the sum of the product of the descriptor pixels and the image pixels.
The final equation is: The implementation of Algorithm 2 becomes Algorithm 3:

Algorithm 3 Efficient scalar ZMSSD
In this instance, we use only one loop. This reduces memory access. Using this implementation, the computing time decrease from 12.60μs to 11.29μs.
Vector implementation SIMD NEON architecture allows vector processing and performs the same operation on all the vector processing-units. We have implemented a vectorized instance of the ZMSSD functional block as follows (Algorithm 4):

Algorithm 4 SIMD vectorized ZMSSD
This instance uses 8 pixels at time. SIMD NEON architecture allows computing eight addition or eight multiplication simultaneously. The processing time of the vector implementation decreases to 1.27 μs. Table 3 summarizes the processing time of the three different implementations of the ZMSSD functional block. The SIMD implementation is approximately 10 times faster than a basic implementation.

Estimation (FB6)
ARM Cortex A8 do not integrate a FPU. That's why the processing time of the estimation FB is significant (Figure 4). To optimize the matrix multiplication, we have used the EIGEN3 library [28] which provides SIMD NEON optimized functions. Figure 6 presents the results of the processing-time of the estimation task implemented on the ARM processor (non-optimized task) and those using the SIMD NEON coprocessor (optimized task). The processing time of the optimized task is approximately eight times faster than those of the non-optimized one. This gain is due to the lack of the FPU in the Cortex A8 and to the efficiency of the NEON to evaluate a multiply and accumulate instruction in only one CPU cycle.

Parallel implementation on a DSP processor
Digital signal processors (DSP) are usually used in vision systems [29]. They integrate a number of resources that serve to enhance image processing versatility. The use of digital signal processing with data sharing ensures that image processing will be achieved in parallel. With a DSP based image processing, it is possible to parallelize the  Figure 6 Processing time of the estimation task on the ARM and NEON coprocessor.
EKF-SLAM algorithm on the multiprocessor architecture (ARM, NEON and DSP processors). This allows enhancing the global processing time especially when we consider to operate in real-time constraints. The landmarks matching (FB3 to FB5) and the robot position estimation (FB6) tasks must be processed sequentially. Fortunately, the initialization tasks (FB7, FB8 and FB9) can run simultaneously with the matching and estimation tasks.
Rethinking the implementation to obtain a parallel implementation, the instance of Algorithm 1 with block partitioning leads to the Algorithm 5. The architecture of the OMAP3530 can interface the ARM and DSP processors using a shared memory. Figure 7 shows the data transfer mechanism using a shared DDR memory area. For each acquired image, the ARM processor writes the image (320×240 pixels), the robot position and its uncertainty on the shared memory. Data transfer between the ARM processor and DSP processor for a 320×240 gray image is done in one millisecond. When the initialization of a landmark is completed, the DSP processor returns the position and the uncertainty of possible new landmarks.

Global results
We have improved the EKF SLAM implementation using the SIMD NEON coprocessor and the DSP processor. We have implemented the matching and estimation tasks on a NEON coprocessor and the initialization tasks on a DSP processor. FAST corner detector is already an optimized algorithm using machine learning [21]. For the latest experiment, we set the same thresholds as Section "Experimental results". Table 4 summarizes the processing time per iteration and the mean processing time per Frame of each FB. The computing time of the initialization task (blocks 7, 8 and 9) implemented on the DSP processor is approximately 4.0 ms. The DSP processor computes the initialization task while the ARM-NEON processors compute the prediction, FAST detection, matching and estimation tasks.
With this implementation and since the processing-time of the initialization task (4.0 ms) is smaller compared to the sum of the processing times of the matching and estimation tasks (13.0 ms for blocks 3, 4, 5 and 6), the overall computing time is reduced to the sum of the processing-times of the prediction process (0.093 ms),  the FAST detector (3.4 ms), the matching and estimation tasks (13.0 ms). The mean processing time per frame with the optimized implementation is 17.6 ms (we add 1 ms for the ARM/DSP data transfer) whereas the nonoptimized implementation has a processing time of 80.85 ms. The optimized processing time represents 22% of the nonoptimized one. The processing time has been reduced by 78%.

Conclusion
This article proposed an efficient implementation of the EKF-SLAM algorithm on a multiprocessor architecture. The overall accuracy of the EKF-SLAM depends on the number of the landmarks in the state vector and the matched observations. Both are linked to the time allowed to the embedded architecture to compute the robot pose. Based on the application constraints (realtime localization) and an evaluation methodology, we have implemented the algorithm in consideration of the underlying hardware architecture. A runtime analyses shows that the FBs and the initialization task represents 99.6% of the global processing time. We have used an optimized instance of the FAST detector. Two FBs (in matching and estimation tasks) have been optimized on an SIMD NEON architecture. The initialization task has been parallelized on a DSP processor. This optimization required a modification of the algorithm implementation. Using the optimized implementation, the global processing time was reduced by a factor equal to 4.7. The results demonstrate that an embedded systems (with a low-cost multiprocessor architecture) can operate under real-time constraints, if the software implementation is designed carefully. To scale with larger environment, we are going to include an approach of local/global mapping as proposed by [30]. Using this approach, we will be able to map larger environment. The map joining system will be implemented on the GPU coprocessor integrated on the OMAP3530.
Other future developments will be centered around a Hardware-software co-design to improve the system performances implementing a system-on-chip with a field programmable gate array (FPGA). The use of a configurable architecture accelerates greatly the design and validation of a proof of real-time and system-on-chip concept.