# Research of multiple-instance learning for target recognition and tracking

- Jiang Qin
^{1}Email author

**Received: **2 December 2015

**Accepted: **2 March 2016

**Published: **15 March 2016

## Abstract

Target recognition and tracking is a hot research in image and video processing and is widely used in motion analysis, behavior recognition, and so on. In this paper, we studied target recognition and tracking in a series of images, and our approach is based on the multiple-instance learning technique. Firstly, we present a general target tracking framework. Within the proposed framework, we use image frames to generate positive and negative samples to train a classifier and use the classifier to differentiate target from its background. We use a set of weak classifiers to construct a strong classifier. The experiments show that the proposed approach has better precision and recall on two public datasets than related works.

## Keywords

## 1 Introduction

Target recognition and tracking is applied in many fields, such as motion analysis [1] and behavior recognition [2]. However, occlusion, similar background, lighting, surface, and etc. pose great challenges for target recognition and tracking, which will make target shift or even tracking fail [3]. Appearance model-based tracking algorithms [4,5] represent targets with scale-invariant feature transformation or histogram of oriented gradient, but these features cannot reflect the basis of targets, and mismatches usually appear in the process of tracking. Moreover, complex appearance models lead to very high computation.

The combination of appearance model and traditional machine learning techniques consumes target tracking as a binary classification problem [6,7], and this method can utilize background information effectively and thus can improve the effectiveness of tracking. However, as there are not enough training data to the classification model, the recognition ability of target is very low and thus misclassification usually occurs. Deep learning is a hot research in image and visual processing. According to construct deep non-linear network model [8,9], the essential features of images can be learned with the constructed model, and then, the classification accuracy is improved.

Flock of tracker [10] combines local trackers with global motion model and can handle the problem of occlusion and local changes of non-rigid targets. Cell flock of tracker [11] tracks targets with the selected optimal local tracker and thus can handle the problem of target shifting and is more robust in target tracking.

Multiple-instance learning is first proposed by Dietterich et al. [12], and it is the fourth machine learning technique besides supervised learning, unsupervised learning, and reinforcement learning. Zhang et al. [13] propose to embed multiple-instance learning into the AnyBoost algorithm framework and construct the MILBoost classifier for target detection. Babenko et al. [14] use multiple-instance learning for target tracking, which gets a good tracking effectiveness, so multiple-instance learning becomes a hot research in target tracking. Zeisl et al. [15] apply the semi-supervised multiple-instance learning for target tracking, in which the target and background of the first frame is assumed to be tagged sample, and targets of the subsequent frames are assumed untagged samples. When the first frame comes, the tagged sample and untagged samples, which are tracked correctly, are priors for the following frame, and this improves the stability of target tracking [16]. In addition, Babenko et al. [17] has analyzed the visual tracking with online multiple-instance learning, but they aim to track the predefined target, and our method can recognize any target from its background.

However, the original multiple-instance learning has the weaknesses of low classification effectiveness and real-time ability. In order to handle these weaknesses, we propose a new weak classifier, which assigns different positive samples, different weights and assigns, different weak classifiers, and different weights. In addition, we propose a strong classifier to improve the accuracy and real-time ability of target tracking.

The rest of the paper is organized as follows. In Section 2, we present our proposed target tracking algorithm based on multiple-instance learning. Experiments and conclusion are given in Sections 3and4, respectively.

## 2 Multiple-instance learning target tracking algorithm

*t*+ 1-th frame; once the

*t*+ 1-th frame is classified, we add it into the training data for future prediction. The classifier evolves as time goes on.

### 2.1 Selection of positive and negative samples

*X*. Let the location of a sample be

*l*

_{ t }at time

*t*, then the category of sample is

*y*∈ {0, 1}, where

*y*= 1, if

*X*is the target, and

*y*= 0, if

*X*is the background.Let the location of the target be\( {l}_{t-1}^{*} \)at time

*t*− 1, then the sample set that is waited for classification at time

*t*is

*l*(

*X*) is the location of sample

*X*and

*s*is the searching radius.

*t*, compute the probability

*p*(

*y*= 1) that all samples

*X*is a positive sample. Let the probability that the target occurs in a cycle region with radius

*s*be uniform, then we have

*X*

^{+}contains

*N*samples, which is a cycle with \( {l}_t^{*} \) as its center, radius

*α*, that is

*X*

^{−}contains

*L*samples, which is a cirque with \( {l}_t^{*} \) as its center, radius from

*β*to

*γ*, that is

### 2.2 Training a classifier

*X*

^{+}and

*X*

^{−}, and then, the probability that a sample is a positive sample is as follows [14]:

where\( \tan h(z)=\frac{e^{H(X)}-{e}^{-H(X)}}{e^{H(X)}+{e}^{-H(X)}} \), *H*(*X*) is a strong classifier of the samples and consists of *K* weak classifiers.

*H*(

*X*) is in the following equation:

*h*

_{ k }(

*X*) is the

*k*th weak classifier and

*λ*

_{ k }is its weight. The weak classifiers are selected according to their classification ability. If a weak classifier is good at classification, then we give it a big weight; otherwise, we give it a small weight. Let \( {\lambda}_k={e}^{\frac{1-k}{K}} \), then the weak classifier is selected from the set of weak classifier set

*Φ*, where

*Φ*= {

*h*

_{1}, …,

*h*

_{ M }} and

*M*>

*K*. The weak classifier set is generated with the following method: let\( {h}_k= \log \left(\frac{p\left(y=1\Big|{f}_k(X)\right)}{p\left(y=0\Big|{f}_k(X)\right)}\right) \), where

*f*

_{ k }(

*X*) is the Haar-like feature [18]; let

*p*(

*y*= 0) =

*p*(

*y*= 1), then, with the Bayes rule, we can have\( {h}_k= \log \left(\frac{p\left({f}_k(X)\Big|y=1\right)}{p\left({f}_k(X)\Big|y=0\right)}\right) \), where

*p*(

*f*

_{ k }(

*X*)|

*y*= 1) and

*p*(

*f*

_{ k }(

*X*)|

*y*= 0) conform to the Gaussian distribution [19], that is

where *μ*
_{1}, *σ*
_{1}, *μ*
_{0}, and *σ*
_{0} are expectations and variances of the two Gaussian distributions.

*μ*

_{ i }and

*σ*

_{ i }are as follows:

where *i* = 0, 1, *η* is the learning coefficient.

### 2.3 Selecting weak classifiers

*Φ*of

*K*weak classifiers, and then, the rule for the selection of weak classifiers is to assure an optimal strong classifier [20]. Babenko et al. [14] propose to ascertain weak classifier

*h*by maximizing the log-likelihood function with both positive and negative sample sets, that is

*L*(

*H*) is computed as follows:

where *c* is the normalization constant.

In Eq. 15, the similarities between negative samples are small, so we let *w* be constant.

*h*with Eq. 12consumes a lot of computing resources, so we use a more efficient approach. Unwrapping

*L*(

*H*

_{ k − 1}+

*λ*

_{ k }

*h*) with the first-order Taylor formula, we have

*y*

_{ i }=

*i*and

*i*= 0, 1.

*L*(

*H*

_{ k − 1}) is already known, so in order to compute the maximum of

*L*(

*H*

_{ k − 1}+

*λ*

_{ k }

*h*), we only need to compute the maximum of\( <{\lambda}_kh,\mathit{\nabla}L(H)>\Big|{}_{H={H}_{k-1}} \); then, the Eq. 12 can be rewrote as follows:

*M*probabilities belonging positive or negative set for each sample, so the computing complexity is very high. In this paper, we propose an algorithm for computing \( H(X)={\displaystyle {\sum}_{k=1}^K{\lambda}_k{h}_k(X)} \), and the algorithm is in algorithm 1. According to the first frame of a video, we find the target to be tracked and generate positive and negative sample set {

*X*

^{+},

*X*

^{−}}, where

*X*

^{+}= {

*X*

_{1j },

*y*

_{1}= 1,

*j*= 0, 1, …,

*N*− 1}, and

*X*

^{−}= {

*X*

_{0j },

*y*

_{0}= 1,

*j*=

*N*, 1, …,

*N*+

*L*− 1}. Next, according to Eqs. 8 and 9, we compute

*p*(

*f*(

*X*

_{1j })|

*y*= 1) and

*p*(

*f*(

*X*

_{0j })|

*y*= 0) and then compute

*h*

_{ k }for

*k*from 1 to

*M*to generate weak classifier set

*Φ*= {

*h*

_{1}, …,

*h*

_{ M }}.

## 3 Experiments

### 3.1 Experimental setup

In the experiments, we use iCoseg [21] and MSRC [22], the two public datasets. The iCoseg dataset consists a series of related images for each object. For example, an athlete moves on a horizontal bar. The MSRC dataset monitors an environment in a forest. In this dataset, a panda occurs and disappears in the camera. We test target recognition and tracking in these two scenes.

The baseline algorithms are MIL [14], OAB [23], and SBT [6]. The MIL algorithm is a classical multiple-instance learning approach for target tracking. The OAB algorithm is a boosting approach for target classification in image series. The SBT algorithm is a semi-supervised machine learning approach, and it uses massive untagged data to improve the accuracy of classification.

### 3.2 Experimental results

While evaluating the performance of the proposed algorithm, we use precision and recall two metrics. Here, we use “Jumping” to represent a woman moving on a horizontal bar and ‘panda’ to represent a panda appearing in a camera.

## 4 Conclusions

In this paper, we studied target recognition and tracking in a series of images, and our approach is based on the multiple-instance learning technique. In the target tracking framework, we use image frames to generate positive and negative samples to train a classifier, and use the classifier to differentiate target from its background. We use a set of weak classifiers to construct a strong classifier. The experiments show that the proposed approach has better precision and recall on two public datasets than related works.

## Declarations

### Acknowledgements

This work was financially supported by the Science and Technology Research Program for the Education Department of Hubei province of China (Q20156002).

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- L Chen, H Wei, J Ferryman, A survey of human motion analysis using depth imagery. Pattern Recogn Lett
**34**(15), 1995–2006 (2013)View ArticleGoogle Scholar - OP Popoola, K Wang, Video-based abnormal human behavior recognition—a review. IEEE Trans Syst Man Cybern Part C Appl Rev
**42**(6), 865–878 (2012)View ArticleGoogle Scholar - A Milan, K Schindler, S Roth, Challenges of ground truth evaluation of multi-target tracking, in
*IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)*, 2013, pp. 735–742Google Scholar - X Jia, H Lu, MH Yang, Visual tracking via adaptive structural local sparse appearance model, in
*IEEE Conference on Computer vision and pattern recognition (CVPR)*, 2012, pp. 1822–1829Google Scholar - S Zhang, H Yao, X Sun et al., Robust visual tracking using an effective appearance model based on sparse coding. ACM Trans Intell Syst Technol
**3**(3), 43 (2012)MathSciNetGoogle Scholar - H Grabner, C Leistner, H Bischof, Semi-supervised on-line boosting for robust tracking, in
*Computer Vision–ECCV*(Springer, Berlin Heidelberg, 2008), pp. 234–247Google Scholar - Z Kalal, J Matas, K Mikolajczyk, Pn learning: bootstrapping binary classifiers by structural constraints, in
*2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE*, 2010, pp. 49–56View ArticleGoogle Scholar - M Denil, L Bazzani, H Larochelle et al., Learning where to attend with deep architectures for image tracking. Neural Comput
**24**(8), 2151–2184 (2012)MathSciNetView ArticleMATHGoogle Scholar - S Zhang, H Yao, X Sun et al., Sparse coding based visual tracking: review and experimental comparison. Pattern Recogn
**46**(7), 1772–1788 (2013)View ArticleGoogle Scholar - V Tomas, M Jiri,
*Robustifying the flock of trackers*(Proceedings of Computer Vision Winter Workshop, Graz, Austria, 2011), pp. 91–97Google Scholar - ME Maresca, A Petrosino, Clustering local motion estimates for robust and efficient object tracking. in Computer Vision-ECCV 2014 Workshops. Springer International Publishing, 2014, pp. 244–253Google Scholar
- TG Dietterich, RH Lathrop, T Lozano-Pérez, Solving the multiple instance problem with axis-parallel rectangles. Artif Intell
**89**(1), 31–71 (1997)View ArticleMATHGoogle Scholar - C Zhang, JC Platt, PA Viola, Multiple instance boosting for object detection, in
*Advances in neural information processing systems*, 2005, pp. 1417–1424Google Scholar - B Babenko, MH Yang, S Belongie, Robust object tracking with online multiple instance learning. IEEE Trans Pattern Anal Mach Intell
**33**(8), 1619–1632 (2011)View ArticleGoogle Scholar - B Zeisl, C Leistner, A Saffari et al., On-line semi-supervised multiple-instance boosting, in
*2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2010, pp. 1879–1879View ArticleGoogle Scholar - Z Wang, S Yoon, S Xie J et al., Visual tracking with semi-supervised online weighted multiple instance learning. Vis. Comput. 2015, pp. 1–14.Google Scholar
- B Babenko, MH Yang, S Belongie, Visual tracking with online multiple instance learning, in
*IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2009, pp. 983–990Google Scholar - R Lienhart, J Maydt, An extended set of haar-like features for rapid object detection, in
*International Conference on Image Processing*, 2002. 1: I-900-I-903 vol. 1Google Scholar - J Gao, H Ling, W Hu et al., Transfer learning based visual tracking with gaussian processes regression. in Computer Vision–ECCV 2014. Springer International Publishing, 2014, pp. 188–203Google Scholar
- B Ma, J Shen, Y Liu et al., Visual tracking using strong classifier and structural local sparse descriptors. IEEE Trans Multimedia
**17**(10), 1818–1828 (2015)View ArticleGoogle Scholar - D Batra, A Kowdle, D Parikh et al., Icoseg: interactive co-segmentation with intelligent scribble guidance, in
*IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2010, pp. 3169–3176Google Scholar - JC Rubio, J Serrat, A López et al., Unsupervised co-segmentation through region matching, in
*IEEE International Conference on Computer Vision and Pattern Recognition*, 2012, pp. 749–756Google Scholar - H Grabner, M Grabner, H Bischof, Real-time tracking via on-line boosting, in
*British Machine Vision Conference*, 2006, pp. 47–56Google Scholar