A methodology for hand and finger motion analysis using adaptive probabilistic models

A methodology for motion analysis and hand tracking based on adaptive probabilistic models is presented. This is done by integrating a deterministic clustering framework and a particle filter together in real time. The skin color of a human hand is firstly segmented. A Bayesian classifier and an adaptive process are utilized for determining skin color probabilities. The methodology enables us to deal with luminance changes. After that, we determine the probabilities of the fingertips by using semicircle models for fitting curves to fingertips. Following this, the deterministic clustering algorithm is utilized to search for regions of interest, and then the Sequential Monte Carlo is also performed to track the fingertips efficiently. Representative experimental results are also included to ensure workability of the proposed framework. Several issues about using the presented method in embedded systems are discussed. The method presented can be used to further develop the associated applications of embedded robotic and virtual reality.


Introduction
Recently, embedded systems are beneficially applied to many autonomous and intelligent robotic fields.One of the possible keys is to make the embedded robot see and understand automatically.In many embedded systems, vision-based methods are used interestingly.Their algorithms are embedded in robots in both hardware and software, including a method about hand motion analysis.This is because if embedded robotic systems are able to recognize human organs automatically, they can apply to various related real-life applications practically.An example includes embedded robots used and researched after 9/11 which are designed to automatically operate and rescue humans within a challenging environment by recognizing human organs without using human eyes.Thus, it is very important to design the embedded robots that can recognize and analyze the motion of human organs in recent years.For this reason, researches about hand motion recognition based on digital image processing technology are becoming popular for embedded systems.This is because computer vision has been applied to many kinds of recent application to assist human motion tracking, especially fingertip tracking methodologies.Previous fingertip tracking methods were presented.For example, a correlation with pre-defined templates was presented in [1].A chromatic distance was discussed in [2].Mackie and McCane [3] also proposed imagedivision-based decision tree recognition.However, these aforementioned methods are not directly applicable to the self-occlusion fingertip tracking.Moreover, the background they used is usually uniform.As a result, it is more complicated to locate the fingertip positions correctly for self-occlusion and in non-uniform background.The proposed methodology for tracking the hand and fingertips solves these aforementioned issues.
To begin with, the hand is segmented in each frame from the background using an adaptive color detection algorithm.A Bayesian classifier is utilized during off-line phase [4].An adaptive algorithm for determining skin probability is then applied to refine the classifier to train the system robustly [5].Following this, we determine probabilities for fingertips by cropping the models of semicircle shape for a fit to the fingertip [6].After superimposing the models on every candidate in the test image, we normalize the results which will be used as the fingertip probability map for tracking.Next, a clustering approach [7] is used to determine for regions of interest (ROIs) and sequential Monte Carlo [8] method is used for tracking by distributing the particles inside the corresponding ROIs.This vision-based methodology enables us to track the fingertips even when some fingers are not fully stretched out or when the luminance changes.
This paper is structured as follows.Literature on previous and conventional works is reviewed in Section 2. Next, the series of steps presented for hand and finger motion analysis using adaptive probabilistic models is described in Section 3.After that, Section 4 provides the experimental setup, including the results and discussion.Ultimately, Section 5 gives a summary of the paper and discusses possible associated embedded robotic applications using the proposed vision-based method.

Related work
Previous works about gesture recognition have been shown useful for various applications.Martínez et al. [9] developed a system for sign language to recognize motion primitives and full sentences.They assume that the same sign has different meanings depending on context.Matilainen et al. [10] presented a finger tracking system using template matching for gesture recognition, focusing on mobile devices.Krejov and Bowden [11] presented a system using a weighted graph and depth information of the hand for determining the geodesic maxima of the surface.In [12], Kereliuk et al. detected the positions of fingertips.The circular Hough transform is used for determining the tips of the fingers.
Nevertheless, these aforementioned gesture recognition methods are not suitably applicable to the fingertip tracking when self-occlusion occurs.Also in [11], the hand and wrist localization works not so smoothly and robustly, while from our experiments, utilizing the Hough transform to detect the fingertips in [12] is not robust enough.This is because fingertip edges cannot be easily detected due to the noise around the fingertips.Also, they did not aim to deal with luminance changes in online process.
We overcome these problems by attempting to segment the skin color of hand robustly.To solve this issue, it is important to understandably address a problem to control the lighting [13].The levels of light between offline and online phases are important for getting the correct registration.A major decision has to be made when deriving a model of color.By simply setting the threshold in color model, the accurate and robust results are rarely obtained.Another method [14] is to use histogram models.Still, it cannot perform adaptively when the levels of light between off-line and online phases are totally different.
To solve this issue, a Bayesian classifier is utilized.Applying this method, the first advantage is that the system can automatically and adaptively learn the probabilities by itself during online phase.From a small amount of training data, the probability is adapted during online phase and converges automatically to a proper value.Thus, it allows us to segment the regions we need robustly even though changing of luminance happens.

Methods
The schematic of the implementation will be explained in this section.After capturing the images, a Bayesian classifier is utilized adaptively to segment the human hand.As the next step, we apply a matching algorithm to determine the probabilities of the fingertips (i.e., fingertip probability map).Then, we extend the standard particle filter by utilizing the clustering algorithm to create ROIs for tracking.In this way, the positions of human hand and fingertips can be visually tracked.

Hand region segmentation
If the projection matrix is known, we can calculate the homography for warping a pre-captured known background.However, the background we used is sometimes dynamic and the background for that area cannot be easily synthesized.The luminance changing also causes a problem for using pre-captured known background image.The pixel color of pre-captured known background and the one from current input would be very different.
In our approach, we want to segment a hand from the input image.We built a color model of the hand image.During learning phase, the color model is also adapted according to changing luminance.In other words, we assume that the hand is a known foreground color model.
To begin with, we calculated the color probabilities being skin color by applying [4] which is composed of two main phases: off-line phase and online phase.First, we selected some images to train the system manually.Second, the probability is updated automatically and adaptively from the new input images [5].In our implementation, we set that the adapting process is automatically disabled as soon as the probabilities are stable.Hence, when we start to learn the online skin color adaptation, we assume that there is enough skin in the image.As soon as the online adapting process is enough as we prefer (i.e., the skin color probability converges to a proper value), we manually stop the adapting process.In this way, after finishing the online learning process, though the skin area disappears from the scene, it does not affect the skin color probability.

Off-line phase
Wei et al. [15] suggested that skin color model based on this space for object segmentation and classification using 3D range camera can provide interesting coverage of human in many races.Similarly, we use their assumption for a 3D color representation (YUV).Nevertheless, we use only UV as it demands less memory storage.This disregard of the luminance value has also been shown to be useful in detection and tracking of color night vision [16].During an off-line phase, Bayes' rule is used for estimating the probability P(s|c) of a color, with c being a skin color using where P(s) is the proportion of the trained skin-colored pixels during off-line phase to the total number of pixels of whole images, P(c) is the proportion of the number of occurrences of each color c to the total number of image points during training, and P(c|s) is the proportion of the number of occurrences of a color c within the skincolored regions to the number of skin-colored image points during training.After that, we use depth-first search method (DFS) to assign non-similar labels to the image pixels of non-similar regions.Filtering based on size of found regions is used to remove noise.Hence, connected components that consist of less than the threshold size are assumed to be noise and then rejected from further consideration.The threshold size for size filtering we used is 500 pixels.It is important to note that we do not need the intrinsic and extrinsic parameters in this step, since we assume that if the noise is smaller than the value we set, we simply eliminate it.

Online phase
This phase is similar to the off-line phase.We recalculate the probabilities again, but we use the values from the new input images.During an online phase, we update the adapted probabilities according to where P A (s|c) is the probability adapted of a color c being a skin color, γ is a sensitivity parameter, and W is the number of history frames.If W value is too high, the length of history frames will be too long; if W value is set too low, the history for adaptation will be too short.Figure 1 shows an example of skin segmentation by adaptive learning robustly.Using this adaptive framework, it is able to deal well with obvious luminance changes.

Determining the probabilities of tips
After segmenting the hand region, we use the semicircle models for a fit to the curved fingertip [6].Six models are utilized to deal with different sizes and orientations of the tips of the fingers.We match semicircle templates against the results of hand segmentation by using where T(x,y) is a searched template at coordinates (x,y), and H (x,y) is a hand segmentation result when the search is running.Following this, we summarize the results of the fingertip models using where N 0 is a number of fingertip models.
Our experimental results have revealed that using the sum of the matches of all fingertip models gives the better result than other combinations (such as using maximum of the matches).A possible reason is that every model is weighted so that the information of all matches is used.In the case that if any matches of all fingertip models are not close to the answer (but the mean of the matches is close to the answer), this can still produce the promising result.However, using the maximum of the matches would give the good result if the results of the matches are very scattered, but from our experiments, this case rarely happens when matching the models for tracking the fingertips.The models are then superimposed on every candidate during testing.Next, we normalize results of each model by using where ∀ (x,y) {R sum (x max , y max ) ≥ R sum (x, y)}.As a result, the probabilities to the fingertips of each pixel can be obtained.

Multiple fingertip tracking
Our method takes the advantages of sequential Monte Carlo [8] about automatic track initialization and recovering whenever the tracking fails.When the fingertips disappear from the scene and then appear back, we can still track the fingertips correctly due to the advantage of utilizing particle filter.However, direct application of sequential Monte Carlo method on multiple object tracking is not feasible because it does not define an obvious way to identify individual hypotheses.
In our previous work [17], we used different colored fingertip markers for tracking.Using colored markers, it is easy to use the standard particle filter to track each marker separately (because of the different colors).However, in the case of markerless tracking, particles are not distributed to each fingertip consistently since each fingertip represents the same hypothesis.To solve this problem, we extend the standard particle filter by applying a deterministic clustering approach as proposed in [7].We create rectangular ROIs in each fingertip, and then we distribute the particles only inside the corresponding ROIs (while the standard particle filer will distribute particles all over the image).

Clustering algorithm
As explained in [8], the idea of clustering is to create rectangular ROIs by determining if the contours found in the fingertip probability map, i.e., R normalized (x, y) are consistent enough using a buffer.The intensity in the gray scale image illustrates the probabilities of the fingertips (higher brightness means higher probability).In this way, after we compute the gray scale image of the fingertip probability map, contours are extracted from the FindContours function implemented in the Intel OpenCV library.In other words, contours meant the area of high probability of the fingertips.Every contour found is stored in the following vector: Denote a set of selected ROIs by Z t = {Z t (j) , j = 1, …, J t }, where J t is the number of regions we found at t within Y t .Every region Z t (j) is built according to a cluster of measurements obtained in Y t and is stored in terms of a set of time and contour indices, i.e., pairs of indices (t,m t ).The concept is to group a collection of contours Y t that are in the spatial vicinity of each other at various time steps.If the targets are divided obviously, the contours from their targets are clustered in the locations where the targets have been visited from t − τ to t.In this case, τ represents the buffer's width.
Given a set of independent contours Y t , we need to find a set of selected regions.Denote a set of selected ROIs by Z t = {Z t (j) , j = 1, …, J t }, where J t is the number of ROIs found at t within Y t .The jth region Z t (j) comprises P t (j) contours at successive scans in Y t that are possible to obtain from the true interesting targets.The concept is to put a collection of contours Y t together.Again, if the targets are divided clearly, the contours from their targets are clustered in places where the targets have been potentially visited from t − τ to t, where τ is the width of the buffer.
Next, we build each region Z t (j) according to a cluster of contours received in Y t .It is then stored in terms of a set of time and contour indices, i.e., pairs of indices (t,m t ).We denote the mth contour of y t' + 1 and the lth contour of y t' by y m,t' + 1 and y l,t' + 1 , respectively.The normalized distance d m,l (t' + 1, t') between y m,t' + 1 and y l,t' + 1 can be calculated from the intersection area between two contours.Our assumption is if the intersection area of two contours is high enough, these two contours should be grouped into the same cluster Z t (j) (so the normalized distance d m,l (t' + 1, t') will be set low).The minimum distance between two contours is also determined to calculate the normalized distance d m,l (t' + 1, t').For every contour of where η 0 represents a given threshold.
After we detect the ROIs, their classification is performed.We classify them differently if they are noise or ROIs.The ROIs we mentioned are possibly both active and inactive.Thus, we carefully determine this issue also.In order to decide this, we find a relationship between the active tracks and the regions that we are interested in.By continuously finding the association and determining the appearance and disappearance of the regions, the system can recognize the number of tracking targets (this case is tips of the fingers) for different stages.Figure 2 depicts an example of gesture hand and fingertip recognition using the deterministic clustering algorithm.

Sequential Monte Carlo
In fact, there are two possible ways to use skin color probability in the particle filter step.Firstly, we can use the skin color probability itself.Secondly, we do threshold before and then use the binarized image.However, in our implementation, we use the second way in this paper.Each sample is propagated from the set s ' t − 1 according to where E is Gaussian noise and g(s ' t (n) ) is a propagation function.We use the noise information as the propagation function, i.e., g(x) = x.Figure 3 presents an example of finger tracking using the extended particle filter.After that, weights are generated by using the probabilities of fingertips from Equation 4. p(X t ) represents the probability density function.Then, the sample set representation {(s t (n) , π t (n) )} of the state density for time t is calculated according to where p(X t = s t (n) ) represents the probability that a fingertip is at position s t (n) .After that, just similarly as a normal particle filter process, the total weights are normalized.Moments of the pixel recognized are calculated at time-step t according to where N is sample that has been built and ε[ f(X t )] is the tip of finger's centroid.Using the aforementioned framework, it enables us to track and achieve recognition.

Results
Figure 4 shows an example tracking of such online experimental results from the total 300 frames.The reported experimental result was run online using an Intel® Core™ i5-3317U Processor at 1.70 GHz.The topleft image represents the input images.This input image is captured from a camera.Note that the camera we used has 320 × 240 display resolution.We capture a scene where a user is showing his hand in front of the camera.The top-right image represents the hand segmentation adaptively.The bottom-left image shows the fingertip probability map.The intensity in this gray scale image represents the probabilities of the tips of the fingers.Higher brightness is higher probability, while lower brightness is lower probability.After performing the clustering algorithm and extended particle filter, the tracked results of fingertips are finally shown in the bottom-right images.The number of particles in the system is 300 particles.From our experimental results, this number of particles is suitable for the proposed methodology.
First, processing time is an important aspect of many embedded systems, especially if we would like to apply the vision-based method to use in embedded systems.However, the computation time for the sequence shown is real time (approximately 12 frames per second without optimization).From this processing time, it is quite convenient to implement the proposed method to use in the embedded systems architecture, especially embedded robotic systems.This is because robots using embedded systems usually need an image processing-based algorithm that can run in real time, or nearly real time.Thus, our experimental speed indicates that the proposed method can support embedded systems positively in this aspect.
The second issue about embedded systems is power.Any system that requires too much electric power is not feasible for embedded robotic systems.In our system, we test to run the system with a portable laptop using an Intel® Core™ i5-3317U processor.The system does not need any additional power.In fact, the system can Figure 2 The deterministic clustering algorithm is used for gesture hand and fingertips recognition.
be powered portably from a lightweight tablet requiring only small amount of power.The laptop battery we used is eight-cell, 14.8 V 47 Wh/3,060 mAh lithium ion battery.It lasts at least 3 h when fully running the proposed tracking algorithm.Note that when it is not running the process, the battery lasts for approximately 4 h.For recharging the battery, the battery we used takes only 2 h which is also obviously convenient for utilizing in many smart embedded robots.Thus, from our experiments, the autonomy of the battery is practical for embedded applications even when it is fully running the tracking method.This means that the proposed visionbased method can easily apply to use in robots in terms of power for battery-powered embedded systems.
At the commencement of the experiment, a user enters the camera view field.Then he starts to change  his hand in different poses.In our method, the number of detected ROIs can be varied according to the number of fingertips appeared in the input images (the number of ROIs is automatically found by the algorithm described in the previous section).For example, there are five fingertips appearing in Figure 5, while there are three and four fingertips appearing in Figures 6 and 7, respectively.However, the system is able to automatically determine the accurate number of appeared fingertips.The experiments have revealed that the system can successfully track the fingertip positions even when the luminance markedly changes from the off-line phase.
In order to evaluate quantitatively the accuracy of this method presented, we select 50 frames from 300 consecutive frames for evaluation, as depicted in Table 1.The predicted trajectory positions found by using the  proposed tracking method are compared to the manually measured ground truth positions (actual).Such ground truth measurements are obtained by manually selecting the positions of the fingertips by mouse clicks.The positions of the tracked tips of the fingers are received by our system.Then we determine the Euclidean distance errors from 320 × 240 total image size in pixels.After obtaining the distance errors in each image, the mean distance error is computed.The standard derivation error is also calculated.It can be seen that the forefinger introduced the maximum mean error at 11.23 pixels, if comparing to the other fingers (5.31 pixels for the little finger, 7.42 pixels for the ring finger, 9.71 pixels for the middle finger, and 8.65 pixels for the thumb).This is because the forefinger usually moves quite fast in this experiment, comparing relatively to the movements of other fingers.So, it gives that the sequential Monte Carlo we used may not perform perfectly when the tracking objects move too quickly.
In this experiment, we use a unique input which is different from other inputs of the previous methods.Thus, it is not easy to compare directly the experimental results exactly to the results obtained by other algorithms.However, even though we do not compare directly to the same sequence of experimental input with other conventional methods, it is obviously seen that our proposed method outperforms qualitatively the previous methods, such as the results obtained by [6].Also, although we do not use the same measurement with [3], with the numbers in Table 1, it is clear that our algorithm outperforms quantitatively the method presented in [3].We believe these errors presented in Table 1 are sufficiently accurate to make the proposed framework a suitable methodology for human hand motion recognition and fingertip tracking.

Conclusions
This paper has developed an algorithm that tracks the positions of the hand and fingertips accurately.The skin-colored region of a user is segmented by applying a Bayesian classifier adaptively and automatically.After that, a matching algorithm is used to determine the probabilities of the fingertips based on their primitives.Following this, we extend the particle filter by using a deterministic clustering algorithm for tracking fingertips.The experimental results have shown that the proposed methodology is effective even with non-uniform backgrounds.The substantial analysis of the proposed method applied in embedded context, such as power, processing time, and the autonomy of battery-operated equipment, has also been  discussed.We believe that the proposed system can reach acceptably accurate results.However, we plan to solve the finger self-occlusion while using multi-cameras.This usually happens when using more than two cameras for stereo images.As part of our future work, we also intend to use this implementation to further develop the associated virtual-reality applications and related embedded robotic systems such as in [18] and [19].

Figure 1
Figure 1 Skin segmentation by adaptive learning.

Figure 3
Figure 3 An extended particle filter is utilized for fingering tracking and recognition.

Figure 4
Figure 4 Representative snapshot at the commencement of the experiment while a user is showing his two fingers.

Figure 5 A
Figure 5 A user starts to change his hand to show five fingers clearly.

Figure 6
Figure6The number of detected ROIs can be varied, as correctly as the number of appeared fingers.

Figure 7
Figure7The tracker can automatically determine the correct number of appeared fingertips.

Table 1
Mean error and standard derivation in five fingertips