Motion meets Attention: Video Motion Prompts

Qixiang Chen, Lei Wang*, Piotr Koniusz, Tom Gedeon

*Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

Abstract

Videos contain rich spatio-temporal information. Traditional methods for extracting motion, used in tasks such as action recognition, often rely on visual contents rather than precise motion features. This phenomenon is referred to as ‘blind motion extraction’ behavior, which proves ine!cient in capturing motions of interest due to a lack of motion-guided cues. Recently, attention mechanisms have enhanced many computer vision tasks by e”ectively highlighting salient visual areas. Inspired by this, we propose a modified Sigmoid function with learnable slope and shift parameters as an attention mechanism to modulate motion signals from frame di”erencing maps. This approach generates a sequence of attention maps that enhance the processing of motion-related video content. To ensure temporal continuity and smoothness of the attention maps, we apply pair-wise temporal attention variation regularization to remove unwanted motions (e.g., noise) while preserving important ones. We then perform Hadamard product between each pair of attention maps and the original video frames to highlight the evolving motions of interest over time. These highlighted motions, termed video motion prompts, are subsequently used as inputs to the model instead of the original video frames. We formalize this process as a motion prompt layer and incorporate the regularization term into the loss function to learn better motion prompts. This layer serves as an adapter between the model and the video data, bridging the gap between traditional ‘blind motion extraction’ and the extraction of relevant motions of interest. We show that our lightweight, plug-and-play motion prompt layer seamlessly integrates into models like SlowFast, X3D, and TimeSformer, enhancing performance on benchmarks such as FineGym and MPII Cooking 2.

Original languageEnglish
Pages (from-to)591-606
Number of pages16
JournalProceedings of Machine Learning Research
Volume260
Publication statusPublished - 2024
Event16th Asian Conference on Machine Learning, ACML 2024 - Hanoi, Viet Nam
Duration: 5 Dec 20248 Dec 2024

Fingerprint

Dive into the research topics of 'Motion meets Attention: Video Motion Prompts'. Together they form a unique fingerprint.

Cite this