TY - JOUR
T1 - Not All Negatives Are Equal
T2 - Learning to Track with Multiple Background Clusters
AU - Zhu, Gao
AU - Porikli, Fatih
AU - Li, Hongdong
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2018/2
Y1 - 2018/2
N2 - Conventional tracking-by-detection approaches for visual object tracking often assume that the task at hand is a binary foreground-versus-background classification problem, in which the background is a single, generic, and all-inclusive class. In contrast, here we argue that the background appearance, for the most part, possesses a more complicated structure that would benefit from further partitioning into multiple contextual clusters. Our observation is that, although the background class is contemplated to contain a vast intra-class variation, during the tracking process, only a small portion of this diversity is present at the current frame around the foreground object. This observation motivates us to build multiple fine-grained foreground-versus-contextual-cluster models that provide more discriminative classifications, and consequently more robust and accurate foreground object tracking. For each cluster, we employ a structured output support vector machine (SSVM), and in an online manner, we combine the responses of multiple classifiers. To this end, we apply a top-level SSVM that models the tracked foreground object. We show that our refined modeling of the background is better than naïvely growing the complexity of a single foreground-background classifier, i.e., increasing the number of support vectors that existing approaches rely on, which cause overfitting issues. Our extensive evaluations on large benchmark data sets demonstrate that our tracker consistently outperforms the current state-of-the-art while having comparable computational requirements.
AB - Conventional tracking-by-detection approaches for visual object tracking often assume that the task at hand is a binary foreground-versus-background classification problem, in which the background is a single, generic, and all-inclusive class. In contrast, here we argue that the background appearance, for the most part, possesses a more complicated structure that would benefit from further partitioning into multiple contextual clusters. Our observation is that, although the background class is contemplated to contain a vast intra-class variation, during the tracking process, only a small portion of this diversity is present at the current frame around the foreground object. This observation motivates us to build multiple fine-grained foreground-versus-contextual-cluster models that provide more discriminative classifications, and consequently more robust and accurate foreground object tracking. For each cluster, we employ a structured output support vector machine (SSVM), and in an online manner, we combine the responses of multiple classifiers. To this end, we apply a top-level SSVM that models the tracked foreground object. We show that our refined modeling of the background is better than naïvely growing the complexity of a single foreground-background classifier, i.e., increasing the number of support vectors that existing approaches rely on, which cause overfitting issues. Our extensive evaluations on large benchmark data sets demonstrate that our tracker consistently outperforms the current state-of-the-art while having comparable computational requirements.
KW - Contextual clustering
KW - fine-grained model
KW - support vector machine (SVM)
KW - tracking by detection
UR - http://www.scopus.com/inward/record.url?scp=85041951080&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2016.2615518
DO - 10.1109/TCSVT.2016.2615518
M3 - Article
SN - 1051-8215
VL - 28
SP - 314
EP - 326
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 2
M1 - 7583707
ER -