 Research
 Open Access
Efficient embedded architectures for fastcharge model predictive controller for battery cell management in electric vehicles
 Anne K. Madsen^{1} and
 Darshika G. Perera^{1}Email authorView ORCID ID profile
https://doi.org/10.1186/s1363901800843
© The Author(s). 2018
 Received: 27 December 2017
 Accepted: 21 May 2018
 Published: 16 July 2018
Abstract
With the evergrowing concerns about carbon emissions and air pollution throughout the world, electric vehicles (EVs) are one of the most viable options for clean transportation. EVs are typically powered by a battery pack such as lithiumion, which is created from a large number of individual cells. In order to enhance the durability and prolong the useful life of the battery pack, it is imperative to monitor and control the battery packs at the cell level. Model predictive controller (MPC) is considered as a feasible technique for celllevel monitoring and controlling of the battery packs. For instance, the fastcharge MPC algorithm keeps the Liion battery cell within its optimal operating parameters while reducing the charging time. In this case, the fastcharge MPC algorithm should be executed on an embedded platform mounted on an individual cell; however, the existing algorithm for this technique is designed for generalpurpose computing. In this research work, we introduce novel, unique, and efficient embedded hardware and software architectures for the fastcharge MPC algorithm, considering the constraints and requirements associated with the embedded devices. We create two unique hardware versions: registerbased and memorybased. Experiments are performed to evaluate and illustrate the feasibility and efficiency of our proposed embedded architectures. Our embedded architectures are generic, parameterized, and scalable. Our hardware designs achieved 100 times speedup compared to its software counterparts.
Keywords
 Embedded architectures
 Model predictive control
 FPGAs
 Hardware accelerators
 Electric vehicles
 Battery cell management
1 Introduction
The adoption of alternative fuel vehicles is considered as one of the major steps towards addressing the issues related to oil dependence, air pollution, and most importantly climate change. Among many options, electricity and hydrogen fuel cells are the top contenders for the alternative fuel for vehicles. Despite numerous initiatives, both from the government and the private sector around the world, to enhance the usage of electric vehicles (EVs), we continue to face many challenges to promote the wider acceptance of EVs by the general public. Some of these major challenges include charging time of the battery and the maximum driving distance of the vehicle [1]. In recent years, major EV manufacturers such as Tesla have been making numerous strides in the electric vehicle industry; however, we still have to overcome the distance traveled, high cost, and charging time constraints to gain the market acceptance.
The electric vehicles (EVs) are often powered by energy storage systems such as battery packs, fuel cells, capacitors, super capacitors, and combinations of the above. From the aforementioned energy storage systems, lithiumion (Liion) battery packs are widely employed in EVs mainly because of their light weight, long life, and high energy density traits [2]. In this case, the battery packs are typically created from individual Liion cells arranged as series and/or parallel modules. The longterm performance (durability) of the Liion battery pack is significantly affected by the choice of the charging strategy. For instance, exceeding the current and voltage constraints of the Liion battery cell can cause irreversible damage and capacity loss that would degrade the longterm performance and curtail the effective life of the battery pack [3]. Conversely, operating within the current and voltage constraints would enhance the durability and prolong the useful life of the battery pack. This requires monitoring and controlling the battery packs at the cell level. However, most of the existing research on the battery management system (BMS) focuses on systemlevel or packlevel control and monitor, as in [2], instead of cell level. Thus, it is crucial to investigate and provide efficient techniques and design methodologies, to monitor and control the battery packs at cell levels and to optimize the parameters of the individual cells, in order to enhance the durability and useful life of the battery packs.
Model predictive controller (MPC) has been investigated as a viable technique for celllevel monitoring and controlling of the battery packs [3]. MPC is a popular control technique that enables incorporating constraints and generating predictions, while allowing the systems to operate at the thresholds of those constraints. For some time, MPC algorithm has been utilized in the industrial processes, typically in nonresourceconstrained environments; however, in recent years, this algorithm is gaining interest in the resourceconstrained environments, including cyberphysical systems and hybrid automotive fuel cells [3], to name a few. The effectiveness of the MPC algorithm for celllevel monitor/control depends on the accuracy of the mathematical model of the battery cell. These mathematical models include equivalent circuit models (ECMs) and physicsbased models. From these, ECM models are more popular due to their simplicity. In [3], the authors prove the efficacy of controlling and providing a fastcharge mechanism for Liion battery cells by integrating the MPC algorithm with an ECM model. This fastcharge MPC mechanism incorporates various constraints such as maximum current, current delta, cell voltages, and cell state of charge, which keep the Liion battery cell within its optimal operating parameters while reducing the charging time. Thus far, this fastcharge MPC algorithm has been designed and developed in Matlab and executed on a desktop computer [3]. However, in a realworld scenario, it is imperative to execute this fastcharge MPC algorithm on an embedded platform mounted on an individual cell, in order to utilize this algorithm to monitor and control the individual cells in a battery pack.
Since the existing algorithm for the fastcharge MPC is designed for generalpurpose computers such as desktops [3, 4], it cannot be executed directly on embedded platforms, in its current form. Furthermore, embedded devices have many constraints, including stringent area and power limitations, lower cost and timetomarket requirements, and highspeed performance. Hence, it is crucial to modify the existing algorithm significantly in order to satisfy the requirements and constraints associated with the embedded devices.
Although MPC is becoming popular, the measurepredictoptimizeapply cycle [5] of the MPC algorithm is computeintensive and requires a significant amount of resources including processing power and memory resources (to store data and results). In this case, the smaller the control and sampling interval (or time), the larger the resource cost. This sheer amount of resource cost also impacts the feasibility and efficiency of designing and developing the MPC algorithms on embedded platforms.
We investigated the existing research works on MPC algorithms, as well as the existing research works on embedded systems designs for MPC algorithms in the literature. Most of the research on discrete linearized statespace MPC focused on reducing either the complexity of the quadratic programming (QP) or increasing the speed of the computation of the QP, or both. The existing works on online MPC methods include fast gradient [6, 7], active set [8–10], interior point [11–16], Newton’s method [9, 17, 18], and Hildreth’s QP [19], and others [20]. In [21], a faster online MPC was achieved by combing several techniques such as explicit MPC, primal barrier interior point method, warm start, and Newton’s method. In [9, 18], the logarithmic number system (LNS)based MPC was designed on a fieldprogrammable gate array (FPGA) to produce integerlike simplicity. The existing research works on embedded systems designs for MPC algorithm focused on FPGAs [8, 11, 12, 17, 22, 23], systemonchip [9, 16], programmable logic controllers (PLC) [24], and embedded microprocessors [25]. Although there were interesting MPC algorithms/designs among the existing research works, none of the aforementioned existing works were suitable for monitoring and controlling individual cells of the battery pack. For instance, the above existing MPC algorithms/designs did not consist of the feedthrough term required by the battery cell model introduced with fastcharge MPC algorithm in [3]. The impact of the feedthrough term is discussed in detail in Section 2.

We introduce unique, novel, and efficient embedded architectures (both hardware and software) for the fastcharge MPC algorithm. Our architectures are generic, parameterized, and scalable; hence, without changing the internal architectures, our designs can be used for any control systems applications that employ similar MPC algorithms with varying parameters and can be executed in different platforms.

Our proposed architectures can also be utilized to control the charging of multiple battery cells individually, in a timemultiplexed fashion, thus significantly reducing the hardware resources required for BMS.

We propose two different hardware versions (HW_v1 and HW_v2). With registerbased HW_v1, a customized and parallel processing architecture is introduced to perform the matrix computations in parallel by mostly utilizing registers to store the data/results. With Block Random Access Memory (BRAM)based HW_v2, an optimized architecture is introduced to address certain issues that have arisen with HW_v1, by employing BRAMs to store the data/results. These two hardware versions can be used in different scenarios, depending on the requirements of the application.

With both hardware versions, we introduce novel and unique submodules, including multiplyandaccumulate (MAC) modules that are capable of processing matrices of varying sizes, and distinguishing and handling the sparse versus dense matrices, to reduce the execution time. These submodules further enhance the speedup and areaefficiency of the overall fastcharge MPC algorithm.

Considering the existing works on embedded designs for MPC, our architectures are the only designs (in the published literature) that support a nonzero feedthrough term for instantaneous feedback. We perform experiments to evaluate the feasibility and efficiency of our embedded designs and to analyze the tradeoffs associated including the speed versus space. Experimental results are obtained in real time while the designs are actually running on the FPGA.
This paper is organized as follows: In Section 2, we discuss and present the background of MPC, including the main stages of the fastcharge MPC algorithm. Our design approach and development platform are presented in Section 3. In Section 4, we detail the internal architectures of our proposed embedded software design and our proposed registerbased and memorybased embedded hardware designs. Our experimental results and analysis are reported in Section 5. In Section 6, we summarize our work and discuss future directions.
2 Background: model predictive controller
The model predictive controller (MPC) utilizes a model of a system (under control) to predict the system’s response to a control signal. Using the predicted response, the control signals are attuned until the target response is achieved, and then, the control signals are applied. For instance, in autonomous vehicles, this model can be used to predict the path of the vehicle. If the predicted path does not match the reference or target path, adjustments are made to the control signals, until the two paths are within an acceptable range.
Our investigation on the existing MPC algorithms revealed that the MPC design in [3] provides a simple, robust, and efficient algorithm for the fast charging of lithiumion battery cells. Hence, this MPC algorithm [3] could potentially be suitable for creating embedded hardware and software designs. The simplicity of this algorithm is based on two major design decisions that reduce the computational complexity of the algorithm, i.e., to use the dualmode MPC technique and Hildreth’s quadratic programming technique [26].
The dualmode MPC technique addresses the computational issue of the infinite prediction horizons. This technique divides the problem space into the nearfuture and the farfuture solution segments. This enables the prediction horizons and control horizons to be decreased significantly, while maintaining the performance on par with the infinite prediction horizons [26]. The application of this technique to the fast charge of batteries with a feedthrough term is detailed in [26]. As discussed in [26], reducing the prediction horizon dramatically reduces the size of the matrices utilized in MPC, which in turn reduces the computation complexity. Trimboli’s group, in [3, 26], evaluated various control horizons and prediction horizons for the optimal performance using the nearfuture and the farfuture approach and determined that the optimal control and prediction horizons to be 1 and 10, respectively.
Hildreth’s quadratic programming (HQP) technique is an iterative process that is deemed suitable for the embedded systems designs [27]. This technique is part of the active set dualprimal quadratic programming (QP) solution, which consists of two main features that are beneficial for embedded designs: (1) no matrix inversion is required, hence managing poorly conditioned matrices and (2) the computations are run on scalars instead of matrices, thus reducing the computation complexity [27]. With the HQP, the intention of the MPC is to bring the battery cell to a fully charged position with the least amount of time. In order to reduce the computational effort [3], the pseudo mintime problem is implemented to achieve the same results as the explicit optimal mintime solution. As a result, the HQP technique is deemed appropriate, although it might produce a suboptimal solution, in case, if the solution fails to converge in the allotted iterations [24]. A recent study [24] revealed that the HQP technique performed faster than the commercial solvers, and it required lean code. However, the main drawback is that it tends to provide the suboptimal solution more often and is also dependent on selecting the optimal number of iterations. In this study [24], the clock speed per iteration of the HQP technique was approximately 15 times faster than the most robust stateoftheart active set solver (qpOASES).
The MPC algorithms can be customized to a specific application or a specific task, based on the requirements of a given application/task. The customized MPC typically reduces the execution overhead required for certain decisionmaking logic that would otherwise be essential for the generalized MPC. Furthermore, embedded architectures are usually designed for a specific application or a specific computation. The above facts demonstrate that the customized MPC algorithms specific to a given model and given constraints are appropriate for embedded hardware/software architectures.
2.1 Dynamic model
As illustrated in Fig. 1, the series resistor R_{0} is the instantaneous response ohmic resistance, when a load is connected to the circuit. In the ECM model, the R_{0} represents the feedthrough term in the MPC general statespace Eq. (3) [3, 4, 26]; the R_{1}C_{1} ladder models the diffusion process; the state of charge (SOC) dependent voltage source, i.e., OCV_{z(t)}, represents the open circuit voltage (OCV). In this case, the relationship between SOC and OCV is nonlinear; thus, it can be implemented as a lookuptable (LUT).
The ECM model has a single control input (i.e., the current) and two measured (or computed) outputs (i.e., the terminal voltage v(t) and the SOC z(t)). The main goal is to bring the battery cell to full SOC with the least amount of time. As a result, the z(t) becomes the output to be controlled, which makes this MPC a singleinput singleoutput (SISO) system. The current i(t), which is the control input signal, is represented in the statespace equations as u(k). By employing the MPC algorithm, our intention is to find the best control input, i(t), in order to produce the fastest charge, while considering the physical constraints of the cell. Typically, the parameters or the elements of the ECM model are temperature dependent.
The creation of our unique and efficient embedded architectures for the MPC algorithm is inspired by and based on the MPC algorithms presented in [3, 4, 26–28], with many modifications to cater to the embedded platforms. The feedthrough term and dualmode adjustments are inspired by and based on the ones in [3, 4, 26].
where D_{m} is the feedthrough term, which is a necessary term for the ECM model of this battery.
and also \( {x}_k=\left[\frac{\Delta {x}_{m,k}}{y_k}\right] \) from adding the integral action.
2.2 Prediction of state and output variables
Trimboli’s group [4, 26] incorporated a feedthrough term in the modified MPC algorithm, which was built upon and extended from the work done in [29]. A detailed description of the extended work can be found in [4, 26], and the synopsis of this approach can be found in [3]. For illustration purposes, the summary of this approach is presented below.
where \( {\mathbf{v}}_{1\times {N}_c}=\left[1\kern0.5em 1\kern0.5em 1\kern0.5em \cdots \kern0.5em 1\right] \).
The aforementioned steps are required to process and complete the MPC algorithm. For our embedded architectures, the above equations (from (10) to (16)) remain the same, since the temperature is considered as a constant. There are four temperaturedependent variables, Q, R_{0}, R_{1}, and r, utilized in the augmented model. These variables are detailed in Section 4.2.1.
2.3 Optimization
2.4 Hildreth’s quadratic programming technique
2.5 Applying control signal
In this case, the state of charge (SOC) (i.e., x_{k + 1}, [0] = z_{k + 1}) is compared to reference values to determine if the Liion battery is fully charged. If the SOC is less than the reference values (z_{k + 1} < reference), then the MPC algorithm is repeated to compute the next control signal.
3 Design approach and development platform
In this research work, we introduce our unique, novel, and efficient embedded architectures (two hardware versions and one software version) for the fastcharge model predictive controller (MPC). Our proposed embedded architectures for the fastcharge MPC algorithm are inspired by and based on the modified MPC algorithm for the lithiumion battery celllevel MPC modeled by Trimboli’s group [3, 4, 26]. We obtained the source codes written in Matlab for the existing fastcharge MPC algorithm from Trimboli’s research group [4]. We use this validated Matlab model as the baseline for the performance and functionality comparison presented in Section 5.
For all our experiments, both software and hardware versions of various computations are implemented using a hierarchical platformbased design approach to facilitate component reuse at different levels of abstraction. Our designs consist of different abstraction levels, where higherlevel functions utilize lowerlevel subfunctions and operators. The fundamental operators such as add, subtract, multiply, divide, compare, and square root are at the lowest level; the vector and matrix operations including matrix multiplication/addition/subtraction are at the next level; the four stages of the MPC, i.e., model generation, optimal solution, Hildreth’s QP process, and state and plant generation, are at the third level of the design hierarchy; and the MPC is at the highest level.
All our hardware and software experiments are carried out on the ML605 FPGA development board [30], which utilizes a Xilinx Virtex 6 XC6VLX240TFF1156 device. The development platform includes large onchip logic resources (37,680 slices), MicroBlaze soft processors, and 2 MB onchip BRAM (Block Random Access Memory) to store data/results.
All the hardware modules are designed in mixed VHDL and Verilog. They are executed on the FPGA (running at 100 MHz) to verify their correctness and performance. Xilinx ISE 14.7 and XPS 14.7 are used for the hardware designs. ModelSim SE and Xilinx ISim 14.7 are used to verify the results and functionalities of the designs. Software modules are written in C and executed on the 32bit RISC MicroBlaze soft processor (running at 100 MHz) on the same FPGA. The soft processor is built using the FPGA generalpurpose logic. Unlike the hard processors such as the PowerPC, the soft processor must be synthesized and fit into the available gate arrays. Xilinx XPS 14.7 and SDK 14.7 are used to design and verify the software modules. The hardware modules for the fundamental operators are designed using singleprecision floatingpoint units [31] from the Xilinx IP core library. The MicroBlaze is also configured to use singleprecision floatingpoint units for the software modules. Conversely, the baseline Matlab model was designed using doubleprecision floatingpoint operators. This has caused some minor discrepancies in certain functionalities of the fastcharge MPC algorithm. These discrepancies are detailed in Section 5.
3.1 Systemlevel design
We also incorporate MicroBlaze soft processor in both the hardware versions. For the embedded hardware, MicroBlaze is configured to have 128 KB of local onchip memory. As illustrated in Fig. 2, our userdesigned hardware module communicates with the MicroBlaze processor and with the other peripherals via AXI bus [32], through the AXI Intellectual Property Interface (IPIF) module, using a set of ports called the Intellectual Property Interconnect (IPIC). For the hardware designs, MicroBlaze processor is only employed to initiate the control cycle, to apply the control signals to the plant, and to determine the plant output signal. Conversely, the userdesigned hardware module performs the whole fastcharge MPC algorithm. The execution times for the hardware as well as the software on MicroBlaze are obtained using the AXI Timer [33] running at 100 MHz.
4 Embedded hardware and software architectures for MPC
In this section, we introduce unique, novel, and efficient embedded architectures (both hardware and software) for the fastcharge model predictive controller (MPC) algorithm. Apart from our main objective, one of our design goals is to create these embedded architectures to monitor and control not only one battery cell but also multiple battery cells individually, in a timemultiplexed fashion, in order to reduce the hardware resources required for BMS.

Stage 1—Compute the augmented model and gain (or data) matrices.

Stage 2—Check the plant state (i.e., whether the charging is completed or not); compute the global optimal solution that is not subjected to constraints; determine whether the constraints are violated or not.

Stage 3—Compute the new or adjusted solution using HQP procedure, if and only if, constraints are violated.

Stage 4—Compute the new plant states and plant outputs. It should be noted that for experimental purposes, the plant output is computed in stage 4; however, in a realworld scenario, the plant output would be a measured value.
In order to enhance the performance and area efficiency of both our embedded hardware and software designs, all the timeinvariant computations are relocated to stage 1 from other stages of the MPC algorithm. In this case, stage 1 is considered as the initial phase, which is performed only once at the beginning of the Control Prediction Cycle, whereas, subsequent stages (stages 2, 3, and 4) are performed in every sampling interval in an iterative fashion. Relocating the timevariant computations to stage 1 dramatically reduces the time taken to perform the subsequent stages and enhances the overall speedup of the MPC algorithm. For an example, consider the P parameter typically associated with stage 3. This P is created by multiplying a 32word vector by a 32word vector to create a 32 × 32 matrix, which comprises 1024 multiplications. This computation would usually take 1032 clock cycles per iteration, if we employ a FPU multiplier, which produces a multiplication result every clock cycle, after an initial latency of 8 clock cycles. With the original fastcharge MPC algorithm [3], the P parameter is computed every time, when the stage 3 is executed. By moving the P parameter computation to stage 1, we save 1032 clock cycles per iteration. These execution times and speedups are detailed in Section 5.

The fastcharge MPC algorithm contains only one matrix inversion, which is timeinvariant, thus needing to be computed only once, provided that the temperature remains constant.

The dualmode approach allows for a short prediction horizon (N_{P} = 10) and a short control horizon (N_{C} = 1), which reduces the size of the matrices while maintaining the required stability. It also reduces the single matrix inversion to a scalar inversion, thus eliminating matrix inversion.
Our proposed embedded architectures for the fastcharge MPC are detailed in the following subsections.
4.1 Embedded software architecture
Initially, we design and develop the software for the fastcharge MPC algorithm in C using the XCode integrated development environment. This software design is executed on a desktop computer with dual core i7 processor. Then, the results are compared and verified with the baseline results from the Matlab code. Both the C and Matlab results are also used to verify our results from the embedded software and hardware designs.
Due to the limited resources of the embedded devices, it is imperative to reduce the code size of the embedded software design. Hence, we dramatically modify the above software design (executed on desktop computer) to fit into the embedded microprocessor, i.e., MicroBlaze. In this case, we make the code leaner and simpler, in such a way that it fits into the program memory available with the embedded microprocessor, without affecting the basic structure and the functionalities of the algorithm. Many design decisions for hardware optimizations are also employed to optimize the embedded software design whenever possible, including reordering certain operations to reduce the redundancy (e.g., computing P parameter in stage 1). We also incorporate techniques to reduce the use of for loops appropriately and perform loop unrolling when the speed is important. Furthermore, we identify parts of the program, where offline computations can be done without exceeding the memory requirements.
Software algorithm for fastcharge MPC
Stage  MPC software algorithm 

1.  1.1. Get temperature 1.2. Call parameter function 1.3. Calculate Φ and G matrices 1.4. Create G_{nf} and G_{ff} (nf = near future and ff = far future) dual mode data) 1.5. Calculate E 1.6. Calculate P (matrix for Hildreth QP) 1.7. Build M (constraints vector) 1.8. Start loop – compare x_{m}[0] (SOC) to reference to see if fully charged. If not fully charged, continue, else exit 
2.  2.1. Calculate F 2.2. Solve FE^{1} (optimal unconstrained Δu from J) 2.3. Build γ (constraints vector) 2.4. Compare: M∆u ≤ γ 
3.  3.1. False – call Hildreth QP, develop new Δu that meets constraints 3.2. True Goto Stage 4 (4.1) 
4.  4.1. Calculate the next control signal, next states, and outputs 4.2. Goto Start Loop (1.8) 
4.2 Embedded hardware designs
In this research work, we design and develop two hardware versions: the registerbased hardware version 1 (HW_v1) and the onchip BRAMbased hardware version 2 (HW_v2). With HW_v1, a customized and parallel processing architecture is introduced to perform the matrix computations in parallel by mostly utilizing registers to store the data/results. By employing a parallel processing architecture, we anticipate an enhancement of the speedup of the overall MPC algorithm. With HW_v2, an optimized architecture is introduced to address certain issues that have arisen with HW_v1. By employing onchip BRAMs to store the data/results, we expect a reduction in overall area, since the registers and the associated interconnects (in HW_v1) typically occupy large space on chip. Conversely, the existing onchip BRAMs are dualport; hence, these could potentially hinder parallel processing of computations.
The registerbased HW_v1 is designed in such a way to follow the software functional flow of the MPC algorithm presented in Table 1, thus having similar characteristics as the embedded software design. In this case, the registers are used to hold the matrices, which is analogous to the indexing of the matrices in C programming. It should be noted that initially, we introduce HW_v1, almost as a proofofconcept work; next, we introduce HW_v2 to address certain issues that have arisen with HW_v1.
Xilinx offers two types of floatingpoint IP cores: AXIbased and nonAXIbased. For the registerbased HW_v1, we use the standard AXIbased IP cores for the fundamental operators. These IP cores provide standardized communications and buffering capabilities and occupy less area on chip, at the expense of higher latency. For the BRAMbased HW_v2, we utilize the nonAXIbased IP cores for the fundamental operators. These IP cores allow the lowest latency adder (5cycle latency) and multiplier (1cycle latency) units to support 100 MHz system clock, at the expense of occupying more area on chip. The nonAXIbased cores have less stringent control and communication protocols; thus, proper timing of signals is required to obtain accurate results. With HW_v2, we manage to use lower latency but more resourceintensive IP cores, since it consists of fewer multipliers and adders, whereas with HW_v1, we have to use higher latency but less resourceintensive IP cores, since it comprises large number of multipliers and adders, due to the parallel processing nature of the design.
Initially, we design and develop the embedded hardware architectures for each stage as separate modules, analogous to our hierarchical platformbased design approach. The hardware designs for each stage consist of a data path and a control path. The control path manages the control signals of the data path as well as the BRAMs/registers. Next, we design a toplevel module to integrate the four stages of the MPC algorithm and to provide necessary communication/control among the stages. Among various control/communication signals, the toplevel module ensures that the plant outputs, the state values, and the input control signals are routed to the correct stages at proper times. The control path of the toplevel module consists of several finitestate machines (FSMs) and multiplexers to control the timing, routing, and internal architectures of the designs. The internal hardware architectures of the four stages of the MPC algorithm are detailed in the following subsections.
4.2.1 Stage 1: augmented model and gain matrices
Stage 1, the initial phase of the MPC algorithm, is performed only once at the beginning of the Control Prediction Cycle. All the timeinvariant computations, which are deemed independent of χ_{k} and u_{k} are relocated and performed in stage 1, to ease the burden of the computeintensive iterative portions of the MPC algorithm.
Computing parameters
Temperature regions for cubic spline
Region  Range  Reference (°C) 

1  − 15 °C ≤ T < − 5 °C  − 15 
2  − 5 °C ≤ T < 5 °C  − 5 
3  5 °C ≤ T < 15 °C  5 
4  15 °C ≤ T < 25 °C  15 
5  25 °C ≤ T < 35 °C  25 
6  35 °C ≤ T  35 
For HW_v1, in stage 1, the parameter module is excluded due to the resource constraints on chip. In this case, for HW_v1, the temperaturedependent parameters are considered as constants and stored in the registers, on the premise that the temperature will remain constant [4]. In this paper, for the experimental results and analysis (in Section 5), we consider the temperature to be constant for both hardware versions. With the current experimental setup, the additional parameter module does not impact the precision or the performance of the proposed embedded designs.
Creating augmented model
The augmented statespace equation matrices are given in Eq. (9) (in Section 2.1), where, Δt is the sampling time (considered as 1 s) and η is the cell efficiency (considered as 0.997). Also, the \( {e}^{\raisebox{1ex}{$\Delta t$}\!\left/ \!\raisebox{1ex}{$\tau $}\right.} \) term is currently stored as a constant and an input for both the hardware versions. For both HW_v1 and HW_v2, the augmented model computes all the elements in Eq. (42) and then stores the values in the correct order of the matrices, in registers (for HW_v1) and in BRAMs (for HW_v2). In addition, the augmented model for HW_v2 computes P_{1} and P_{2} in Eq. (24).
Computing gain matrices
As demonstrated in Fig. 7, both hardware versions have the same internal architecture for computing the Φ matrix. In this case, HW_v1 waits until Φ matrix computation is completed and then loads G_{rf} and G_{ff}. Also, HW_v1 employs two gain matrix modules to compute \( \left[{\Phi}_v,\kern0.5em {G}_{nfv},\kern0.5em {G}_{ffv}\right] \) and \( \left[{\Phi}_z,\kern0.5em {G}_{nfz},\kern0.5em {G}_{ffz}\right] \) matrices in parallel.
Timeinvariant computations for HW_v1
As mentioned in Section 4.2.1, all the timeinvariant computations (E, M, P, and submatrices of F), which are deemed independent of χ_{k} or u_{k} (from stages 2 and 3), are relocated to stage 1, thus significantly reducing the computation burden in other stages. For HW_v1 and HW_v2, these computations are designed using different techniques. For registerbased HW_v1, we employ parallel processing architecture, whereas BRAMbased HW_v2 is executed in pipeline fashion.
E module for HW_v1
F_sub module for HW_v1
M module for HW_v1
HW_v1 employs separate modules to perform G_{ff}m and \( \Phi \tilde{A} \). The internal architecture for \( \Phi \tilde{A} \) is demonstrated in Fig. 11, and the architecture of G_{ff}m computation is similar to the VM submodule. With the M module, the negation operations (in Eqs. (61) and (63)) are performed by reversing the most significant bit (MSB) of the 32bit floatingpoint values, thus reducing the logic utilized for these operations.
P module for HW_v1
In this case, P is a square symmetric matrix; hence, the number of columns and rows are equal to the length of M (in our case 32). To compute this matrix, we use an efficient computation assignment algorithm developed by our group [34]. Utilizing this algorithm, elements of P matrix are executed in parallel using several parallel PEs. In this case, n number of PEs process n number of elements (of the matrix) at a time and computes the whole P matrix with no idle time.
Due to the size of the P matrix (32 × 32), registers are not suitable to store the matrix on chip. Our attempt to store the matrix using registers caused our initial design to exceed the chip resources by 25%. Therefore, we integrate BRAM to the P module to store the P matrix in HW_v1. In this case, we use only two PEs to compute elements of the P matrix, due to the port limitations of the BRAMs. The PEs consist of a multiplier and logic elements to ensure that the inputs to the multiplier are ready every clock cycle to reduce the latency. The results of P matrix computation are reused in stage 3.
In summary, HW_v1 is designed with separate modules, including GFFm, ΦA, E, F_sub, M, and P, to execute various computations in stage 1. In this case, two GFFm modules compute Eqs. (59) and (60) in parallel, and two ΦA modules compute Eqs. (64) and (65) in parallel. The F_sub module computes Eqs. (55)–(58), the M module computes Eqs. (61)–(63), and finally P module computes Eqs. (66) and (67).
Timeinvariant computations for HW_v2
For the internal architecture for HW_v2, we use a novel and unique approach to perform the E, F, M, and P computations. In this case, we design a unique pipelined multiplyandaccumulator (MACx) module to perform various vector and matrix multiplication operations in sequence. The MACx has a wrapper, which handles reading/writing from/to the BRAMs during the vector/matrix operations.
For HW_v2, the matrix addition and the scalar operations are typically performed in the E, M, and P modules. In this case, the E module organizes the scalar addition, multiplication, and division necessary to generate E^{−1}. The M module performs the scalar addition and multiplication to generate M (for Eqs. (61)–(63)) and F_{1a} (for Eq. (55)), when using BRAMs to store the vectors. The Eqs. (61) and (55) would generate the same values.
Furthermore, in HW_v1, computations such as G_{ffz}m are not available for subsequent operations until the whole computation has been completed (i.e., all the elements are computed). Conversely, in HW_v2, after one element is computed in one operation, that element can be used in subsequent operations. For instance, for HW_v2, when MACx completes the first vector computation (i.e., G_{ffz_row0} * m), the resulting element and the first element of G_{nfz} in Eq. (63) is utilized by the M module to generate the first element of Mposz. This dramatically reduces the time required to execute stage 1, as detailed in Section 5.
 1.
E_{5a} = m^{T}m, Eq. (49). In this case, a single ROM port is utilized to preload the m vector into both input buffers of the MACx. This occurs in parallel with the Φ and the gain matrix calculations. After the multiply and add operations of the MACx are completed, the output MACx module sends a signal to the E module, indicating that this value is ready. The E module accesses the value from the MACx output register and multiplies this value with P_{2} to create E_{5.} The MACx output register is also the input register used to store the data in RAMhigh. This value is stored in the memory while the E module accesses the value to send it to an adder.
 2.
\( {E}_{3a}={\mathbf{m}}^T{G}_{ffz}^T={G}_{ffz}\mathbf{m}={Mposz}_a \), Eq. (47). From step 1, the m vector is already loaded into one input buffer of the MACx, and single RAMlow port is required to load a row of G_{ffz} into the other input buffer of MACx. The multiplier sends a signal to the input module to preload the next row of G_{ffz} into the MACx input register. The m vector remains in the input buffer until cleared or overwritten. This step continues until all the rows of G_{ffz} have been entered. Once the required vector is available, the output MACx module sends a signal to the M and input modules and then loads the vector into RAMhigh. The M module uses this vector (E_{3a}) to create F_{1a}. E_{3a} is also used in step 5 to create E_{3}. Next, steps 3 and 4 are selected to be executed, since inputs to these steps are already available. Furthermore, these two steps can be executed in the pipeline with no stall states.
 3.
\( {E}_1={G}_{nfz}^T{G}_{nfz} \), Eq. (45). Since G_{nfz} is a vector, a single RAM port is required to load G_{nfz} into both MACx input buffers. After completing this computation, the output MACx module sends a signal to the E module, indicating that this value is ready. The E module adds this value (E_{1}) to E_{5} and stores it in a temporary register.
 4.
\( {E}_{2a}={G}_{nfz}^T{G}_{ffz} \), Eq. (46). From step 3, G_{nfz} is already loaded into the MACx input buffer and a single RAM port is used to prefetch the columns of G_{ffz}. Once the multiplier indicates that it starts executing, the input module preloads the next column of G_{ffz} into the input buffer to compute the next term of E_{2a}. This step continues for all the columns of G_{ffz}. E_{2a} is used in step 6 to create E_{2}. As a result, E_{2a} is stored in RAMlow, and a signal is forwarded to the input module once it is completed.
 5.
E_{3} = E_{4a} = E_{3a}G_{ffz}, Eq. (48). The time it takes to load steps 3 and 4 ensures that the operation started in step 2 (E_{3a}) is completed. The input module ensures that this value is ready by checking the complete signal. One port from each RAM is used to preload E_{3a}, while a column of G_{ffz} is loaded into the MACx input buffers. This step continues until all the columns of G_{ffz} have been loaded. Upon completion of the MACx operations, the output module sends a signal to the E module indicating that the value (E_{3}) is available. The E module accesses the MACx output register to add this value to E_{5} + E_{1}.
 6.
E_{2} = E_{2a}m, Eq. (46). The m vector is loaded into one MACx input buffer using a ROM port. Simultaneously, E_{2a} is completed, and step 5 is being executed. Then, E_{2a} is loaded into the other input buffer using a RAMlow port. Once the MACx operation is completed, the output module sends a signal to the E module and the E module accesses the MACx output register to add this value to E_{5} + E_{1} + E_{3}.
 7.
Mposv_{a} = G_{ffv}m, Eq. (59). As mentioned before, the m vector is already present in the input buffer of the MACx. Hence, a RAM port is required to load the rows of G_{ffv} into the other MACx input buffer. This step continues until all the rows of G_{ffv} have been operated on. Once the MACx operations are completed, the output module sends a signal to the M module. The M module uses this value to build the M constraint vector.
 8.
E_{4} = E_{4a}m = E_{3}m, Eq. (48). For step 8, the m vector is still present in the input buffer, and E_{3} is completed, while step 7 is being executed. A single RAMlow port is required to load the E_{3} into the MACx input buffer. Upon completion of the MACx operations, the output module sends a signal to the E module indicating that E_{4} is completed. The E module accesses the RAM input data register to add E_{4} to E_{5} + E_{1} + E_{3} + E_{2} + P_{1} to create the final E value.
 9.
F_{2a} = F_{1a}Φ_{z}, Eq. (57). F_{1a} is calculated in the M module using the output from step 7 and loaded into a FIFO buffer to eliminate any memory access for step 9. F_{1a} is loaded into the input buffer from the FIFO. Simultaneously, the first column from Φ is loaded into the other input buffer from RAM. This step continues until all three columns of Φ have been loaded into the MACx. Once the MACx computations for F_{2a} are completed, the MACx output module sends a signal to the MACx input module, to initiate the execution of step 10.
 10.
\( {F}_{2c}={F}_{2a}\tilde{A} \), Eq. (58). Once the input module receives a signal that step 9 is completed, the F_{2a} vector is loaded into one input buffer and the first column of Ã is loaded into the other input buffer of MACx. This step continues until the three columns of Ã have been loaded into the MACx. Once the computations for F_{2c} are completed, this value is stored in the memory and a Done signal is set to indicate the completion of this step.
4.2.2 Stage 2: unconstrained solution
 1.
Determine whether the battery has reached a full charge, i.e., x_{m0} ≥ 0.9, which indicates that the state of charge (SOC) is greater than or equal to 90% full. This limit is x_{m0} ≥ 0.9 designed to prevent overcharging of the battery [3].
 2.
Compute the current open circuit voltage (OCV) value based on the input SOC or x_{m0.}
 3.
Compute the unconstrained general optimal solution for the control input, Δu^{ο} = − E^{−1}F, from Eq. (30).
 4.
Compute the γ constraint vector from Eq. (31).
 5.
Compute MΔu^{ο} from Eq. (31).
 6.
Compute K from Eq. (35).
 7.
Compute an element by element comparison, MΔu^{ο} ≤ γ, from Eq. (31).
From the above steps, vector K is computed in stage 2, although it is utilized in stage 3, since K needs to be computed only once per time sample. The time sample for controlling the charging of a battery is 1 s. For instance, the control signal is updated every second for charging or discharging a battery cell. In this case, steps 2 and 3 are performed in parallel; next, steps 4 and 5 are performed in parallel; and finally, steps 6 and 7 are performed in sequence.
Computing OCV for HW_v1 and HW_v2
OCV computation from SOC
Open circuit voltage from state of charge algorithm  

1. Determine the boundary conditions: if (x_{m0} < 0) use the minimum precalculated OCV. else if (x_{m0} > 1) use the maximum precalculated OCV else if (0 < x_{m0} < 1) compute OCV using steps 3 to 5. 2. Find the Index Index = int(200*x_{m0}) 3. Find the difference (D) and offset (S) D = I – 200*x_{m0} S = 1  D 4. Compute the OCV using temperature (T) OCV = (OCV_{0}[I] ∗ S + OCV_{0}[I + 1] ∗ D) + T ∗ (OCV_{rel}[I] ∗ S + OCV_{rel}[I + 1] ∗ D) 
Computing unconstrained general optimal solution for HW_v1
Computing γ constraint vector for HW_v1
Computing MΔu° for HW_v1
Computing K vector for HW_v1
In HW_v1, the K vector is computed before the final step 7 (in stage 2), which is to perform the comparison operation. The K vector is one of the first operands of stage 3. The K vector computation requires a minimum of 32 subtractions. In this case, in order to ensure that K is ready for stage 3, K vector is computed before performing the comparison as presented in Eq. (31), MΔu^{ο} ≤ γ. As illustrated in Fig. 19b, K module is a simple VV module, which consists of a subtractor to subtract each element of the input vectors.
Computing comparison for HW_v1
In the final step in stage 2, for HW_v1, the two vectors MΔu° and γ were compared element by element using a FPU comparator. The internal architecture of the comparison module is illustrated in Fig. 19c. In this case, if the constraints are not violated, the comparison module performs all 32 compare operations and then goes to stage 4. However, if the constraints are violated, the comparison module triggers stage 3 and relinquishes the execution of the remaining compare operations.
Computing unconstrained solutions for HW_v2
In stage 2, similar to stage 1, for the internal architecture for HW_v2, we use the pipelined MACx module for the matrix and vector multiplication operations. The utilization of the MACx module (for HW_v2) drastically reduces the occupied area on chip for stage 2 compared to that of HW_v1. For instance, for the OCV module, HW_v1 uses 20 dedicated IP cores, whereas HW_v2 uses only 8 dedicated IP cores. The space analysis is detailed in Section 5.
 1.
F_{2c}χ for F in Eq. (68)
 2.
\( \Phi {}_z\tilde{A}\chi \) for γ in Eq. (26)
 3.
\( {\Phi}_v\tilde{A}\chi \) for γ in Eq. (26)
Since the maximum length of the individual vectors is 3, the 5stage pipelined MACx module uses only the first three pipeline stages, reducing the overall execution time.
The input AU module sends the necessary operands to the AU module, which performs the remaining operations (not performed by MACx) in stage 2. The output AU module forwards the results to be stored in the BRAM. With the AU module, multiplication results are generated every clock cycle after an initial latency of 1 clock cycle, and addition/subtraction results are also generated in every clock cycle after an initial latency of 5 clock cycles.
Handshaking protocol is used to communicate between the input AU and output AU modules. After completing any intermediate computations, the output AU module sends a signal to the input AU module, indicating that the intermediate data (results from previous arithmetic operations) are ready for subsequent arithmetic operations. Utilizing two modules (i.e., input AU and output AU) to read from the memory and write to the memory separately, significantly reduces the complexity of the control path for both modules. This also minimizes the setup and hold time violations, thus improving the overall efficiency of stage 2.
In HW_v2 design, the comparison (final step 7) is performed while computing K, instead of using a separate comparator module as in HW_v1. Considering Eq. (35), K = γ − MΔu^{ο}, and the comparison Eq. (31), MΔu^{ο} ≤ γ, if K > 0, then the comparison is true. Hence, by comparing the MSB of K, we can determine whether the constraints are met or not. If all the elements meet the constraints, then the optimal solution is selected and stage 4 is executed bypassing stage 3. In HW_v2, if one or more elements violate the constraints, then we start executing stage 3 immediately, after performing the K computation in stage 2. This significantly reduces the time taken to perform the compare operations (as in HW_v1) utilizing a separate module. As illustrated in Fig. 20, HW_v2 has an integrated solution for stage 2, whereas HW_v1 has a modular solution (depicted in Figs. 17, 18, and 19).
4.2.3 Stage 3: Hildreth’s quadratic programming
In stage 3, we compute the constrained optimal control input using Hildreth’s quadratic programming (HQP) approach. With this approach, the Δu°, which is known as the global optimal solution, is adjusted by λME^{−1} (as in Eq. (38)), where λ is a vector of Lagrange multipliers.
Initially, for stage 3, we use the primaldual method for active set approach, which reduces the total constraints down to active constraints (i.e., nonzero λ elements), thus reducing the computation complexity (3 or less computations versus 32 computations). Apart from reducing the computation complexity of stage 3, this approach also reduces the computation complexity of stage 4, since the stage 4 design needs to compute only 1 to 3 active elements of the lambda (λ) vector versus computing all 32 elements.
Next, we use the HQP technique, which further simplifies the above computations by finding the vector of Lagrange multipliers (λ), for the HQP solution one element at a time. This HQP technique eliminates the need for matrix inversion in optimization. In this case, the λ vector has either positive nonzero values for active constraints or zero values for inactive constraints.
Typically, not all the constraints are active at the same time, making λ a sparse vector. Since only the active constraints need to be considered, both hardware versions are designed in such a way to execute the sparse vector to reduce the total computations involved for the operation.
It should be noted that the HQP technique does not always converge. Therefore, a suitable iteration length (number of iterations) is selected, in order to provide the greatest possibility for convergence, as well as to provide a reasonable solution in case if there is no convergence.
 1.
Compute individual elements of λ vector from Eqs. (36) and (37).
 2.
Determine whether the λ vector meets the convergence criteria.
 3.
If it does, compute the new Δu using the updated λ vector, else go to step 1.
For both hardware versions (HW_v1 and HW_v2), we decompose stage 3 into the above three main modules, illustrated in steps 1 to 3. Firstly, the λ module (Wp3) computes the first λ vector. Secondly, the convergence module (Converge_v1) determines whether the current λ vector converges or not; simultaneously, the λ module computes the next λ vector. If the current λ vector converges, then the λ module stops the execution of the next λ vector. In this case, the λ module performs the computations of Eqs. (36) and (37) (from Section 2) on each element.
HQP algorithm
Hildreth’s quadratic programming technique (HQP algorithm)  

For iterations 1 to 40 1. Save λ_{current} →λ_{previous} 2. Start outer loop to build λ, i = 0 to # elements in M or M_{size} a. w = 0; b. start inner loop to build λ, j starts at 0 i. w = w + P[i][j]∙λ[j] ii. GOTO start inner loop If j<M_{size}, c. w = w + K[i]P[i][i]∙λ[i] d. λ_{test} = w/P[i][i] e. if λ_{test} < 0 then λ[i] = 0, else λ[i] = λ_{test} f. GOTO start outer loop if i<M_{size} 3. Check convergence a. calculate the Euclidean length of previous λ b. calculate the Euclidean length of current λ c. Compare ratio to reference value d. if converged, exit iteration, GOTO calculate new Δu 4. Else execute next iteration, GOTO 1. 5. Calculate new Δu a. Start loop, j = 0 to j=M_{size} i. Δu_{c} = Δu_{c} +∙λ[j] ME^{−1}[j] b. GOTO start loop if j<M_{size} c. Δu_{k+1} = Δu°Δu_{c}  
6. End 
For HW_v1
HW_v1 consists of three main modules, including Wp3, Converge_v1, and New_Δu_v1, and a submodule (SVM_v1) for sparse vector multiplication.
From our experimental results (presented in Section 5), it is observed that any λ vectors typically have a maximum number of three nonzero elements. Hence, our hardware is designed to operate only on the nonzero elements of λ and P. In order to generate all the elements of the λ vector, the computations 2.a to 2.f (as in Table 4) must be repeated 32 times. By focusing only on the nonzero elements, our hardware design dramatically reduces the time taken to generate the required λ elements, since certain steps are bypassed in Table 4.
In this case, the λ vector is updated, after 32 iterations. Then, the updated λ vector is forwarded to the convergence module (Converge_v1). Next, the Converge_v1 module computes step 3 of the HQP algorithm (in Table 4); simultaneously, the New_Δu_v1 module computes step 5 of the HQP algorithm (to generate Δu_{k + 1}) in anticipation of a convergence. At the same time, the λ module (Wp3) starts computing the next λ vector, in the event the current λ does not converge. If the convergence fails, the Δu_{k + 1} value is discarded. If the convergence succeeds, a signal is sent to Wp3 module to terminate the next λ vector computation, and then, the subsequent stage (stage 4) is started with input Δu_{k + 1}.
For HW_v2
For HW_v2, similar to HW_v1, we introduce another sparse vector multiplication (SVM_v2) module, in order to utilize only the active set (or nonzero values of the λ vector), thus enhancing the efficiency of the design. This is because the pipelined MACx is not efficient for singlevector multiplication operations. In Win module, addressing logic is incorporated to track the nonzero elements of the λ vector. These nonzero λ elements and the corresponding indexes are stored as vectors in the BRAMs. The indexes are used to find the corresponding P and ME^{−1} values, thus reducing the number of operations without compromising the accuracy of these values. In this case, the number of operations are reduced from 32 to 3 or less.
First, the Win module sends the vector elements to the multiplier, and signals the SVM_v2 module that the sparse vector operation is initiated. Next, if the results are valid, then the valid signal is asserted to start incrementing the counter, and to start loading the results to the FIFO buffer. In this case, the counter is incremented if the multiplier valid signal is asserted (high), the counter is decremented if the adder valid signal is asserted; and the counter is on hold if both the valid signals are asserted or deasserted (low) simultaneously. The FIFO buffer is used to bridge the latency between the multiplier and the adder. If the count is 1, the SVM_v2 module forwards the multiplication results to the output, bypassing the adder.
The internal architecture of the convergence module (Converge_v2 module) is shown in Fig. 25b. To determine the convergence of the λ vector, the Euclidean distance is computed. In HW_v2, the Euclidean distance is measured as each element of the λ is computed, one element at a time. Conversely, in HW_v1, this distance is measured after all 32 elements are computed. In this case, the λnorm module (in Fig. 25b) takes the scalar λ_{i} as inputs, squares the λ_{i}, and then adds the squared value to the previous element. After computing the final λ element, which is λ_{31}, the output of λnorm is then forwarded to a square root module. The result from the square root module is the Euclidean distance. This result is the Euclidean distance used as the current value of the current λ (substep 3.b of the HQP algorithm in Table 4) in the current iteration; this result is stored and used as the previous value of previous λ (substep 3.a of the HQP algorithm) in the next iteration.
In this case, the Win module sends λ_{i} to the multiplier and signals the λnorm module that the required data is ready. Next, the λnorm module waits until the multiplier valid signal is asserted, then accumulates the outputs using an adder. After Win module informs that the iterations for λ are completed, the final accumulator result of the λnorm module is sent to the FPU square root module to initiate the execution of the Converge_v2 module. The Converge_v2 module typically waits for the square root valid signal to be asserted. During this time, the Converge_v2 module inverts the previous λ length value using a divider.
The entire process is repeated up to 40 times. The system either converges, or after the 40th iteration is considered to be converged. Next, we start executing the New_Δu_v2 module. In this case, the Win module loads the λ and the ME^{−1} values into the multiplier for the SVM_v2 module to process and sends a signal to the New_Δu_v2 module to initiate the execution. Depending on the length of the active set (nonzero elements) in the λ vector, the New_Δu_v2 module selects either the output of the multiplier or the output of SVM_v2 to be the input to its subtractor. The result of the subtraction is Δu_{k + 1} value, which is forwarded to stage 4 for processing.
Finally, in stage 3, a clear operation is performed to clear the FIFO, which occurs at the end of vector multiplication by SVM_v2 module. This ensures that invalid data is not incorporated in any computations. The clear operation takes 4 clock cycles and asserts a ready signal to indicate that the result of the SVM_v2 module is ready to be used and also the SVM_v2 module is ready for the next computation.
4.2.4 Stage 4: state and plant
In stage 4, we compute and update the plant state and the plant outputs, using the new Δu (computed in stages 2 or 3) and also utilizing χ, which contains the current states and the current control signal u. In a realworld scenario, the plant outputs are measured and the control signals are sent to the plant input or actuators.
The updated plant states and the input control signals are forwarded to stage 2 for the next iteration. Prior to starting the next iteration, the toplevel module (in Section 4.2) determines whether the plant state value (x_{m0}) is fully charged or whether we have reached the maximum number of iterations.
During stage 4, we compute the plant output, which is to determine the current terminal voltage (v_{k}) and then the state of charge (z_{k}) from Eqs. (7) and (8), respectively. Then, the control signal and the state signals are updated. In this case, the first element of the ΔU_{k} is used to update the control signal from Eq. (39) and the new control signal is used to determine the states for the next iteration from Eq. (40) (in Section 2.5).
For HW_v1
In this case, as shown in Fig. 26, for HW_v1, the voltage v_{k} and the state of charge z_{k} are computed in the plant module, and the control signal u_{k + 1} and the states x_{k + 1} are computed in the state module. The plant and state modules are executed in parallel.
For HW_v2
(1)  C_{v}x_{k} → by SVM_v2 module  (8)  C_{v}x_{k} + D_{v}u_{k}→ by adder 
(2)  D_{v}u_{k}→ by multiplier  (9)  C_{z}x_{k} + D_{z}u_{k}→ by adder 
(3)  C_{z}x_{k} → by SVM_v2 module  (10)  B_{m0}u_{k + 1}→ by multiplier 
(4)  D_{z}u_{k}→ by multiplier  (11)  B_{m1}u_{k + 1}→ by multiplier 
(5)  A_{m_row0}x_{k}→ by SVM_v2 module  (12)  C_{v}x_{k} + D_{v}u_{k} + OCV(z_{k})→ by adder 
(6)  A_{m_row1}x_{k}→ by SVM_v2 module  (13)  A_{m_row0}x_{k} + B_{m0}u_{k + 1}→ by adder 
(7)  u_{k} + Δu_{k + 1}→ by adder  (14)  A_{m_row1}x_{k} + B_{m1}u_{k + 1}→ by adder 
With the above arrangement, we manage to overlap the SVM_v2 module computations with the multiplier/adder computations, thus reducing the overall execution time for stage 4. In this case, the multiplier and adder modules produce results every clock cycle, and these results are forwarded to the Output module to be stored in a BRAM. Conversely, the time taken for the SVM_v2 module to produce results varies and often depends on the length of the input vectors, and these results are forwarded to the SVM_store module to be stored in a BRAM. Hence, the final result of step 2 is available (in BRAM) before the final result of step 1 is available (in BRAM). This concurrent execution of operations significantly reduces the performance bottleneck in stage 4. For HW_v2 in stage 4, we reuse the SVM_v2 module from stage 3. The adder and multiplier IP cores are also reused in other stages to reduce the overall space occupied on chip.
After stage 4 computations are completed, we start computing stage 2. In stage 2, the updated state of charge (SOC) value is compared with the reference value to determine whether the battery is fully charged. The MPC algorithm iterates through stages 2, 3, and 4, until the battery reaches its fully charged condition.
5 Experimental results and analysis
We perform experiments to evaluate the feasibility and efficiency of our proposed embedded hardware and software architectures for the fastcharge model predictive controller (MPC). We also compare our proposed embedded architectures with the baseline model of the fastcharge MPC written in Matlab [4], in order to evaluate and validate the correctness and functionalities of our designs. The evaluation setup for our embedded designs is based on real implementations, whereas the evaluation setup for the baseline Matlab model is based on simulation. Our embedded hardware and software results are obtained in real time, while these designs are actually running on the Virtex6 chip. Conversely, the baseline Matlab results are obtained through the simulation on a desktop computer. Apart from embedded designs, our software design written in C is also executed on a desktop computer, and the corresponding results are compared with the baseline Matlab results. All our experiments are performed with a sample time of 1 s, temperature at 25 °C, and iterations of 3600.
5.1 Functional verification—comparison with baseline model
It is imperative to ensure that our embedded hardware and software architectures operate correctly; hence, we compare our proposed embedded architectures with the baseline model written in Matlab [4].
Figures 30a, and b depict the SOC of the battery as a percentage. As illustrated in these graphs, our embedded hardware architectures (HW_v1 and HW_v2) and our embedded software architecture show similar behavior as that of the baseline Matlab for the SOC.
Although at a glance the SOC graphs (Fig. 30a) seem identical for all four designs, at a closer look, there are some discrepancies. As illustrated in Fig. 30b, the SOC increases sharply with the embedded systems designs, whereas SOC increases gradually with the baseline Matlab design. In this case, both designs reach full charge before the expected time of 1216 s, which is determined from the baseline experiments.
Figures 31a, b depict the terminal voltage of the battery. As illustrated in these graphs, our embedded hardware architectures (HW_v1 and HW_v2) and our embedded software architecture show similar behavior as that of the baseline Matlab design for the terminal voltage. As demonstrated in Fig. 31b, the output voltage does not exceed 4.2 V; this illustrates that the system’s behavior respects the constraints in order to extend the useful life of the battery.
Similar to SOC graphs, at a glance, the terminal voltage graphs (Fig. 31a) seem identical for all four designs; at a closer look (Fig. 31b), there are some discrepancies. For instance, at time t = 0 s, the initial cell terminal voltage value for the embedded systems designs is 3.92 V, whereas that for the baseline Matlab design is 4.11 V. Further experiments and analysis confirm that this discrepancy does not affect the overall functionalities of the system or the final outcome of the MPC algorithm.
As illustrated in Fig. 31b, the Cell Terminal voltage increases gradually and smoothly with the baseline Matlab design, whereas the Cell Terminal voltage increases sharply in the beginning and then decreases gradually with the embedded systems designs. In this case, the difference in value between the above two is merely 1.2 mV.
Figures 32a, b depict the control signal, i.e., the current, generated by the designs that drive the terminal voltage and the SOC responses. As illustrated in these graphs, our embedded hardware architectures (HW_v1 and HW_v2) and our embedded software architecture show similar behavior as that of the baseline Matlab design for the control signal (I_{cell}). In this case, a negative value for the current means that the current is flowing into and charging the battery, rather than flowing out of the battery and being used in the system.
The current starts out at a constant, with a maximum allowed value of (− 15 A). The negative value indicates that current is charging the battery instead of powering the system. The current starts to gradually decay to zero once the terminal voltage reaches its maximum voltage and then the current is held constant. The current shows a steep decay at around 1200 s, which is when the SCO is 90% and the battery is considered fully charged.
Similar to terminal voltage graphs, at a glance, the control signal graphs (Fig. 32a) seem identical for all four designs; at a closer look (Fig. 32b), there are some discrepancies. As illustrated in Fig. 32b, the discrepancies are prominent between the timeline 1090 and 1120 s. However, these discrepancies do not affect the overall functionality and the final results of the designs, thus negligible.
5.1.1 Summary
From these results and analysis, we can conclude that our embedded designs show similar behaviors and functionalities as that of the baseline Matlab model, thus confirming the correctness and functionality of our designs. There are some slight discrepancies in the order of millivolts for the voltage and milliamps for the current. These slight discrepancies are mainly because we use singleprecision floatingpoint units for our embedded hardware and software architectures, whereas baseline Matlab model was created using doubleprecision floatingpoint units. In addition, we use different techniques to solve the linear algebra equations, instead of the existing techniques used in the baseline model, which might further contribute to these discrepancies. Further experiments and analysis reveal that these discrepancies are too small to have an impact on the overall functionalities and the performance of the fastcharge MPC, thus negligible.
5.2 Performance metrics—execution time and resource utilization
We perform experiments to evaluate the feasibility and efficiency of our embedded hardware and software architectures in terms of speed performance and resource utilization on chip.
5.2.1 Execution times and speedup: embedded hardware versus software on MicroBlaze and Intel i7
Execution times: embedded hardware and software design and baseline Matlab model
Configuration  Execution time/ms  Speedup over embedded Sw  Speedup over baseline Matlab 

Embedded Sw on MicroBlaze (at 100 MHz)  3958.04  –  0.21 
Baseline Matlab on i7 processor (at 3.1 GHz)  848.331  4.67  – 
HW_v1 (at 100 MHz)  468.557  8.45  1.81 
HW_v2 (at 100 MHz)  39.774  99.51  21.33 
The total time taken to execute the baseline Matlab design is also presented in Table 5. The execution time for the Matlab model is also measured 10 times, and the average is presented. The baseline design is executed on Intel i7 processor running at 3.1 GHz on a desktop computer.
From Table 5, considering the total execution time, our embedded hardware version 2 (HW_v2) is almost 100 times faster, and our embedded hardware version 1 (HW_v1) is almost 9 times faster than the equivalent software (Sw) running on the embedded MicroBlaze processor. Furthermore, our HW_v2 is 21 times faster, and our HW_v1 is almost 2 times faster than the baseline Matlab model running on Intel i7 processor. It should be noted that all our embedded systems designs are running at 100 MHz, whereas the Matlab model is running at 3.1 GHz.
Unlike the embedded hardware and software designs, the Matlab model is designed in such a way, so that it terminates the execution of stages 2 and 3, once the system meets the threshold for the fully charged. Next, the Matlab model only executes stage 4 for the remainder of the MPC computation. Hence, the total time obtained for Matlab model (presented in Table 5) is not the time taken to execute the fastcharge MPC for 3600 iterations but much less than that. As a result, it is difficult to make a direct execution time comparison between the baseline Matlab model and the embedded systems designs. However, as illustrated in Table 5, our embedded hardware designs still achieve better speedup compared to the Matlab model running on a highperformance processor. With these speedups, our proposed hardware designs should be able to monitor and control multiple battery cells individually.
From the above results and analysis, it is observed that our registerbased HW_v1 is much slower than the BRAMbased HW_v2. Typically, the registerbased designs should provide better computing power compared to the memorybased designs, since there is an execution overhead associated with reading/writing from/to the onchip memory in the latter. In this case, the read operation and the write operation from/to onchip memory take 1 clock cycle each. However, our memorybased HW_v2 design achieves higher speed performance. This is mainly because our initial experience gained throughout the design and development of HW_v1 enables us to enhance the efficiency of HW_v2. Furthermore, the speed performance is also impacted by the compact nature and area efficiency of the memorybased design, as discussed in the following subsection.
5.2.2 Resource utilization: registerbased HW_v1 versus BRAMbased HW_v2
Resource utilization: embedded HW_v2 versus embedded HW_v1
Configuration  Number of occupied slices  Number of BRAMs (36E1)  Number of DSP48E1 

HW v1  34,315  62  688 
HW_v2  10,277  35  73 
As observed from Table 6, with the BRAMbased HW_v2, we achieve 70% of space saving in terms of total number of occupied slices and 89% space saving in terms of total number of DSP slices, compared to the registerbased HW_v1. Furthermore, we also achieve 44% space saving in terms of total number of BRAMs, with HW_v2 compared to HW_v1, which is unexpected, since it is assumed that the BRAMbased designs naturally would utilize more BRAMs than the registerbased designs.
From the above results and analyses, it is evident that the BRAMbased HW_v2 is significantly more area efficient than the registerbased HW_v1; hence, the former is more suitable for embedded devices due the stringent area constraints of these devices.
5.2.3 Analysis of iteration time per cycle for BRAMbased HW_v2
We analyze the per iteration time only for our BRAMbased HW_v2, since this hardware version is more superior than the HW_v1, embedded software design, and also the baseline Matlab model, in terms of speed and area.
In this case, the execution overhead of the augmented model in stage 1 is approximately 24 μs and considered to be minimal. This execution overhead is the time difference between the first iteration (which includes the processing time through stages 1 to 4 during an iteration, as illustrated in Fig. 33b) and the second iteration (which includes the processing time through stages 2 to 4 during an iteration, as illustrated in Fig. 33c). For both the first and the second iterations, stage 3 (for HQP) converges in two loops, which takes approximately 20.5 μs. Hence, we logically make an assumption that the difference in execution time between the iteration 1 and 2 is the time taken for stage 1 to complete. In this case, stages 2 and 4 require 5.18 μs to process (as in Fig. 33d), thus leaving the remainder of the time for stage 3. The time to process stage 3 depends on two factors: the number of nonzero λ elements and the number of iterations required for convergence. For our proposed embedded HW_v2 for the fastcharge MPC algorithm, the processing time for stage 3 typically varies from 15.3 to 444.5 μs. The minimum time (15.3 μs) is associated with 1λ element and 2 iterations, and the maximum time (444.5 μs) is associated with 2λ elements and 40 iterations. In the worstcase scenario, by assuming that the first iteration does not converge, the worstcase iteration time is 474 μs (i.e., adding 24 to 450 μs). In this case, the fastcharge MPC algorithm could execute more than 2100 times, in 1s sample time, thus allowing our embedded architecture to control multiple battery cells individually.
5.2.4 Analysis of existing works on embedded designs for MPC
In Section 1, we discussed and analyzed the existing research work on embedded architectures for MPC algorithm. From this investigation, it was evident that similar work does not exist, specifically for the fastcharge MPC algorithm. Therefore, it was difficult to make a fair comparison between the algorithms. However, we extended our investigation and selected few existing works that had slightly closer traits to our proposed embedded designs. These designs are discussed and analyzed as follows: A closely related work was presented in [35], which proposed a hardwaresoftware codesign design for MPC. This design comprised a microprocessor and a matrix coprocessor. The design utilized a logarithmic number system (LNS) instead of a floating point, and a Newton’s algorithm instead of a HQP, as in our design. Unlike our design, in [35], the model parameters were precalculated offline and stored in the microprocessor. In [8], an MPCdedicated processor was proposed, which utilized a mix of fixedpoint and floatingpoint numbers. Similar to our design, this design also utilized the HQP technique but with Laguerre functions. The processor was designed using Matlab and evaluated using Simulink; however, no actual hardware architecture was implemented. In [16], fixedpoint MPC solution was proposed with two separate QP solvers as userdesigned modules: primaldual interiorpoint QP for sparse matrices, and fastgradient QP for dense matrices. Unlike our design, this design utilized the MicroBlaze processor to handle all communication and control of the two userdesigned modules. Furthermore, most of the existing designs had different control horizons and prediction horizons, which significantly impacts the total execution time of the MPC algorithm. Also, all the above designs were implemented on different platforms, affecting the resource utilization. The above facts made it difficult to perform a direct comparison between the algorithms in terms of speed and space. In addition, it is evident that our architectures are the only embedded designs in the published literature that support a nonzero feedthrough term for instantaneous feedback.
6 Conclusions
In this paper, we introduced unique, novel, and efficient embedded hardware and software architectures for the fastcharge model predictive control (MPC) for battery cell management in electric vehicles. Our embedded hardware and software architectures are generic and parameterized. Hence, without changing the internal architectures, our embedded designs can be utilized for many other control systems applications that employ similar fastcharge MPC algorithms.
Our BRAMbased HW_v2 achieved superior speedup (100 times faster than its software counterpart), and our registerbased HW_v1 also achieved substantial speedup (9 times faster than the equivalent software). Furthermore, our BRAMbased HW_v2 achieved significant space saving compared to our registerbased HW_v1, in this case, 70% of space saving in terms of total number of occupied slices. Thus, it is important to consider the speedspace tradeoffs, especially in embedded devices due to their limited hardware footprints. These two unique embedded hardware versions can be used in different scenarios, depending on the requirements of the applications and the available resources of the embedded platforms.
Our novel and unique embedded software architecture is also created to be lean, compact, and simple; thus, it fits into the available program memory (in this case 128 Kb) with the embedded processor, without affecting the basic structure and the functionalities of the algorithm. We could potentially reduce the program memory usage significantly by constraining the flexibility of the embedded software design. This would allow the embedded processor to incorporate other functionalities and algorithms, if necessary.
Due to the superior speedup, our embedded hardware architecture, as a single processing unit, could potentially monitor and control multiple battery cells, while treating each battery cell individually. Considering a typical battery pack made up of 84 cells, our single embedded hardware processing unit could easily execute the fastcharge MPC algorithm for all the 84 cells with the required 1s sample time, since the worstcase iteration time per cycle is a mere 474 μs. As future work, we are planning to investigate how to interface with all or some of the battery cells in a pack at a time, how to share the bus in such a way to avoid the contention issues, and so on. We are also exploring sophisticated power analysis tools, such as Synopsys Power Compiler, to measure the power consumption of our proposed embedded designs, since power consumption is another major issue in embedded devices.
Our proposed embedded architectures (both hardware and software) for the fastcharge MPC can be utilized as a smart sensor at the battery cell level, locally. Monitoring and controlling certain important parameters of the battery cells at the lowest level will indeed ease the computational burden at the system level. This will also reduce the communication overhead between the battery cells and the global control system and will provide more autonomous control to the battery cells. Also as future work, we will be investigating the feasibility and efficiency of utilizing our embedded architectures for the fastcharge MPC for other control systems applications such as unmanned aerial vehicles (UAVs) and autonomous vehicles [36].
Declarations
Acknowledgements
The authors would like to thank Dr. Scott Trimboli, Assistant Professor, in Electrical and Computer Engineering Department at the University of Colorado at Colorado Springs, and his former PhD student, Dr. Marcelo Xavier, for providing access to the baseline Matlab model for the fastcharge MPC algorithm.
Availability of data and materials
All data generated or analyzed during this study are included in this published article.
Authors’ contributions
AM is DP’s PhD student. AM and DP have been conducting this research. Under the guidance of DP, AM has designed, developed, and implemented the embedded hardware and software architectures for the fastcharge MPC algorithm and performed the experiments. With the assistance of AM, DP wrote the paper. Both authors read and approved the final manuscript.
Authors’ information
Anne K. Madsen received her M.Sc. in Electrical Engineering, and B.Sc. in General Engineering from Naval Postgraduate School (Monterey, CA) and US Naval Academy, respectively. Anne is pursuing her Ph.D. and working as a teaching assistant in the Department of Electrical and Computer Engineering, University of Colorado at Colorado Springs. Anne is also an Independent Engineering Consultant to Rim Technologies. She served as an Officer in the US Navy and taught at both the Air Force and Navy Service Academies, in Math and Engineering divisions, respectively. She also worked as an Engineer and Acquisition Specialist for Air Force Space and Missile Command’s GroundBased Space Surveillance Division for 14 years. Her research interests are cyberphysical systems, control theory, and hardware optimization.
Darshika G. Perera received her Ph.D. degree in Electrical and Computer Engineering from University of Victoria (Canada) and M.Sc. and B.Sc. degrees in Electrical Engineering from Royal Institute of Technology (Sweden) and University of Peradeniya (Sri Lanka) respectively. She is an Assistant Professor in the Department of Electrical and Computer Engineering, University of Colorado at Colorado Springs (UCCS), USA, and also an Adjunct Assistant Professor in the Department of Electrical and Computer Engineering, University of Victoria, Canada. Prior to joining UCCS, Darshika worked as the Senior Engineer and Group Leader of Embedded Systems at CMC Microsystems, Canada. Her research interests are reconfigurable computing, mobile and embedded systems, data mining, and digital systems. Darshika received a best paper award at the IEEE 3PGCIC conference in 2011. She serves on organizing and program committees for several IEEE/ACM conferences and workshops and as a reviewer for several IEEE, Springer, and Elsevier journals. She is a member of the IEEE, IEEE CAS and Computer Societies, and IEEE Women in Engineering.
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Brand, C., Cluzel, C., Anable, J., Modeling the uptake of plugin vehicles in a heterogeneous car market using a consumer segmentation approach, Transp. Res. A Policy Pract., 97, 2017, 121–136, https://doi.org/10.1016/j.tra.2017.01.017. ISSN 09658564. http://www.sciencedirect.com/science/article/pii/S0965856416302130.
 Du, J., Wang, Y., Tripathi, A., Lam, J., Liion Battery Cell Equalization by Modules with Chain Structure Switched Capacitors, 2016 Asian Conference on Energy, Power and Transportation Electrification (ACEPT). Singapore, pp. 16, 2016.Google Scholar
 Xavier, MA, & Trimboli, MS. (2015). Lithiumion battery celllevel control using constrained model predictive control and equivalent circuit models. J. Power Sources, 285, 374–384.View ArticleGoogle Scholar
 Xavier, MA (2013). Lithiumion battery cell management: a model predictive control approach, University of Colorado Colorado Springs, Department of Elelctrical and Computer Engineering (). Colorado Springs: UCCS.Google Scholar
 Takacs, G, & Rohal'Ilkiv, B (2012). Model predictive vibration control. London: Springer.View ArticleMATHGoogle Scholar
 Novak, J, & Chalupa, P. (2014). Implementation aspects of embedded MPC with fast gradient method. Int J Circuits Syst Signal Process, 8, 504–511.Google Scholar
 Zometa, P, Kogel, M, Faulwasser, T, Findeisen, R (2012). Implementation aspects of model predictive control for embedded systems, American Control Conference, 2012 (pp. 1205–1210). Montreal: IEEE.Google Scholar
 Chen, X, & Wu, X (2011). Design and implementation of model predictive control algorithms for small satellite threeaxis stabilization, Proceedings of the IEEE International Conference on Information and Automation (). Shenzhen: IEEE.Google Scholar
 Bleris, LG, Vouzis, PD, Arnold, MG, Kothare, MV (2006). A coprocessor FPGA platform for the implementation of realtime model predictive control, American Control Conference (pp. 1912–1917). Minneapolis: IEEE.Google Scholar
 Abdolhosseini, M, Zhang, YM, Rabbath, CA (2012). Trajectory tracking with model predictive control for an unmanned quadrotor helicopter: theory and flight test results. In CY Su, S Rakheja, H Liu (Eds.), Intelligent robotics and applications: lecture notes in computer science, (pp. 411–420). Berlin: Springer.View ArticleGoogle Scholar
 Ling, KV, Wu, BF, Maciejowski, JM (2008). Embedded model predictive control (MPC) using a FPGA, Proceedings of the 17th World Congress, The International Federation of Automatic Control (pp. 15250–15255). Seoul: IFAC.Google Scholar
 Jerez, J. L., Constantinides, G. A., and Kerrigan, E. C. (2011). An FPGA implementation of a sparse quadratic programming solver for constrained model predictive control. FPGA '11 (pp. 209–218). Monterey: ACM.Google Scholar
 Abbes, AK, Bouani, F, Ksouri, M. (2011). A microcontroller implementation of constrained model predictive control. Int J Electr Electron Eng, 5(3), 199–206.Google Scholar
 Chui, CK, Nguyen, BP, Ho, Y, Wu, Z, Nguyen, M, Hong, GS, et al. (2013). Embedded realtime model predictive control for glucose regulation, World Congress on Medical Physics and Biomedical Engineering May 26–31, 2012 (pp. 1437–1440). Beijing: Springer.Google Scholar
 Nguyen, BP, Ho, Y, Wu, Z, Chui, CK (2012). Implementation of model predictive control with modified minimal model on lowpower RISC microcontrollers, Proceedings of the Third Symposium on Information and Communication Technology (pp. 165–171). Ha Long: ACM.Google Scholar
 Hartley, EN, Jerez, JL, Suardi, A, Maciejowski, JM, Kerrigan, EC, Constantinides, GA. (2014). Predictive control using and FPGA with application to aircraft control. IEEE Trans. Control Syst. Technol., 22(3), 1006–1017.View ArticleGoogle Scholar
 Bleris, LG, & Kothare, MV (2005). Realtime implementation of model predictive control, American Control Conference (pp. 4166–4171). Portland: IEEE.Google Scholar
 Bleris, L. G., Kothare, M. V., Garcia, J., and Arnold, M. G. (2004). Embedded model predictive control for systemonachip applications.Google Scholar
 Chen, X. and X. Wu, (2012) Implementation and Experimental Validation of Classic MPC on Programmable Logic Controllers, 2012 20th Mediterranean Conference on Control & Automation (MED) (pp. 679684), Barcelona, Spain, 2012.Google Scholar
 Ekaputri, C., and SyaichuRohman, A. (2012) Implementation Model Predictive Control (MPC) Algorithm3 for Inverted Pendulum. 2012 IEEE Control and System Graduate Research Colloquium (pp. 116122). Shah Alam, Selangor, IEEE.Google Scholar
 Wang, Y, & Boyd, S. (2010). Fast model predictive control using online optimization. IEEE Trans. Control Syst. Technol., 18(2), 267–278.View ArticleGoogle Scholar
 Aridhi, E., Abbes, M., and Mami, A. (2012, March). FPGA implementation of predictive control. In Electrotechnical Conference (MELECON), 2012 16th IEEE Mediterranean, (pp. 191196). Yasmine Hammamet, Tunisia, IEEE.Google Scholar
 MartínezRodríguez, M. C., Brox, P., Tena, E., Acosta, A. J., and Baturone, I. (2015, March). Programmable ASICs for model predictive control. In Industrial Technology (ICIT), 2015 IEEE International Conference on (pp. 15931598). Seville, Spain, IEEE.Google Scholar
 Huyck, B., Ferreau, H. J., Diehl, M., De Brabanter, J., Van Impe, J. F., De Moor, B., et al. (2012). Towards Online Model Predictive Control on a Programmable Logic Controller: Practical Considerations. Mathematical Problems in Engineering, 20 pages, 2012.Google Scholar
 Lima, DM, Americano da Costa, MV, NormeyRico, JE (2013). A flexible low cost embedded system for model predictive control of industrial processes, 2013 European Control Congerence (ECC) (pp. 1571–1576). Zurich: IEEE.Google Scholar
 Xavier, MA (2016). Efficient strategies for predictive celllevel control of lithiumion batteries, University of Colorado Colorado Springs, Department of Elelctrical and Computer Engineering (). Colorado Springs: UCCS.Google Scholar
 Wang, L (2009). Model predictive control system design and implementation using Matlab. London: SpringerVerlag.MATHGoogle Scholar
 Holkar, KS, & Waghmare, LM. (2010). An overview of model predictive control. Int J Control Autom Syst, 3(4), 47–63.Google Scholar
 Ordys, AW, & Pike, AW (1998). State space generalized predictive control incorporating direct through terms, Proceedings of the 37th IEEE Conference on Decision and Control (pp. 4740–4741). Tamp: IEEE.Google Scholar
 Xilinx, Inc., “ML605 Hardware User Guide”, UG534 (v1.5), 2011, www.xilinx.com/support/documentation/boards_and_kits/ug534.pdf.Google Scholar
 Xilinx, Inc., “LogiCORE IP FloatingPoint Operator”, DS335 (v5.0), http://www.xilinx.com/support/documentation/ip_documentation/floating_point_ds335.pdf., 2011.Google Scholar
 Xilinx, Inc., “LogiCORE IP AXI Interconnect”, DS768 (v1.06.a), http://www.xilinx.com/support/documentation/ip_documentation/axi_interconnect/v1_06_a/ds768_axi_interconnect.pdf., 2012Google Scholar
 Xilinx, Inc., “LogiCORE IP AXI Timer”, DS764 (v1.03.a), 2012, http://www.xilinx.com/support/documentation/ip_documentation/axi_timer/v1_03_a/axi_timer_ds764.pdf.Google Scholar
 Perera, DG, & Li, KF (2008). Parallel computation of similarity measures using an FPGAbased processor array. In Proceedings of 22nd IEEE international conference on advanced information networking and applications, (AINA'08), (pp. 955–962).Google Scholar
 Vouzis, PD, Bleris, LG, Arnold, MG, Kothare, MV. (2009). A systemonachip implementation for embedded realtime model predictive control. IEEE Trans. Control Syst. Technol., 17(5), 1006–1017.View ArticleGoogle Scholar
 Chen, B, Yang, Z, Huang, S, Du, X, Cui, Z, Bhimani, J, Xie, X, Mi, N (2017). Cyberphysical system enabled nearby traffic flow modelling for autonomous vehicles, 2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC) (pp. 1–6).Google Scholar