TY - GEN
T1 - Higher-order pooling of cnn features via kernel linearization for action recognition
AU - Cherian, Anoop
AU - Koniusz, Piotr
AU - Gould, Stephen
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/5/11
Y1 - 2017/5/11
N2 - Most successful deep learning algorithms for action recognition extend models designed for image-based tasks such as object recognition to video. Such extensions are typically trained for actions on single video frames or very short clips, and then their predictions from sliding-windows over the video sequence are pooled for recognizing the action at the sequence level. Usually this pooling step uses the first-order statistics of frame-level action predictions. In this paper, we explore the advantages of using higherorder correlations, specifically, we introduce Higher-order Kernel (HOK) descriptors generated from the late fusion of CNN classifier scores from all the frames in a sequence. To generate these descriptors, we use the idea of kernel linearization. Specifically, a similarity kernel matrix, which captures the temporal evolution of deep classifier scores, is first linearized into kernel feature maps. The HOK descriptors are then generated from the higher-order cooccurrences of these feature maps, and are then used as input to a video-level classifier. We provide experiments on two fine-grained action recognition datasets, and show that our scheme leads to state-of-The-Art results.
AB - Most successful deep learning algorithms for action recognition extend models designed for image-based tasks such as object recognition to video. Such extensions are typically trained for actions on single video frames or very short clips, and then their predictions from sliding-windows over the video sequence are pooled for recognizing the action at the sequence level. Usually this pooling step uses the first-order statistics of frame-level action predictions. In this paper, we explore the advantages of using higherorder correlations, specifically, we introduce Higher-order Kernel (HOK) descriptors generated from the late fusion of CNN classifier scores from all the frames in a sequence. To generate these descriptors, we use the idea of kernel linearization. Specifically, a similarity kernel matrix, which captures the temporal evolution of deep classifier scores, is first linearized into kernel feature maps. The HOK descriptors are then generated from the higher-order cooccurrences of these feature maps, and are then used as input to a video-level classifier. We provide experiments on two fine-grained action recognition datasets, and show that our scheme leads to state-of-The-Art results.
UR - http://www.scopus.com/inward/record.url?scp=85020202908&partnerID=8YFLogxK
U2 - 10.1109/WACV.2017.22
DO - 10.1109/WACV.2017.22
M3 - Conference contribution
T3 - Proceedings - 2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017
SP - 130
EP - 138
BT - Proceedings - 2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 17th IEEE Winter Conference on Applications of Computer Vision, WACV 2017
Y2 - 24 March 2017 through 31 March 2017
ER -