TY - JOUR
T1 - Convolutional neural net bagging for online visual tracking
AU - Li, Hanxi
AU - Li, Yi
AU - Porikli, Fatih
N1 - Publisher Copyright:
© 2016
PY - 2016/12/1
Y1 - 2016/12/1
N2 - Recently, Convolutional Neural Nets (CNNs) have been successfully applied to online visual tracking. However, a major problem is that such models may be inevitably over-fitted due to two main factors. The first one is the label noise because the online training of any models relies solely on the detection of the previous frames. The second one is the model uncertainty due to the randomized training strategy. In this work, we cope with noisy labels and the model uncertainty within the framework of bagging (bootstrap aggregating), resulting in efficient and effective visual tracking. Instead of using multiple models in a bag, we design a single multitask CNN for learning effective feature representations of the target object. In our model, each task has the same structure and shares the same set of convolutional features, but is trained using different random samples generated for different tasks. A significant advantage is that the bagging overhead for our model is minimal, and no extra efforts are needed to handle the outputs of different tasks as done in those multi-lifespan models. Experiments demonstrate that our CNN tracker outperforms the state-of-the-art methods on three recent benchmarks (over 80 video sequences), which illustrates the superiority of the feature representations learned by our purely online bagging framework.
AB - Recently, Convolutional Neural Nets (CNNs) have been successfully applied to online visual tracking. However, a major problem is that such models may be inevitably over-fitted due to two main factors. The first one is the label noise because the online training of any models relies solely on the detection of the previous frames. The second one is the model uncertainty due to the randomized training strategy. In this work, we cope with noisy labels and the model uncertainty within the framework of bagging (bootstrap aggregating), resulting in efficient and effective visual tracking. Instead of using multiple models in a bag, we design a single multitask CNN for learning effective feature representations of the target object. In our model, each task has the same structure and shares the same set of convolutional features, but is trained using different random samples generated for different tasks. A significant advantage is that the bagging overhead for our model is minimal, and no extra efforts are needed to handle the outputs of different tasks as done in those multi-lifespan models. Experiments demonstrate that our CNN tracker outperforms the state-of-the-art methods on three recent benchmarks (over 80 video sequences), which illustrates the superiority of the feature representations learned by our purely online bagging framework.
KW - Deep learning
KW - Ensemble learning
KW - Visual tracking
UR - http://www.scopus.com/inward/record.url?scp=84994130587&partnerID=8YFLogxK
U2 - 10.1016/j.cviu.2016.07.002
DO - 10.1016/j.cviu.2016.07.002
M3 - Article
SN - 1077-3142
VL - 153
SP - 120
EP - 129
JO - Computer Vision and Image Understanding
JF - Computer Vision and Image Understanding
ER -