TY - GEN
T1 - Proposal-free temporal moment localization of a natural-language query in video using guided attention
AU - Rodriguez-Opazo, Cristian
AU - Marrese-Taylor, Edison
AU - Saleh, Fatemeh Sadat
AU - Li, Hongdong
AU - Gould, Stephen
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/3
Y1 - 2020/3
N2 - This paper studies the problem of temporal moment localization in a long untrimmed video using natural language as the query. Given an untrimmed video and a query sentence, the goal is to determine the start and end of the relevant visual moment in the video that corresponds to the query sentence. While most previous works have tackled this by a propose-and-rank approach, we introduce a more efficient, end-to-end trainable, and proposal-free approach that is built upon three key components: a dynamic filter which adaptively transfers language information to visual domain attention map, a new loss function to guide the model to attend the most relevant part of the video, and soft labels to cope with annotation uncertainties. Our method is evaluated on three standard benchmark datasets, Charades-STA, TACoS and ActivityNet-Captions. Experimental results show our method outperforms state-of-the-art methods on these datasets, confirming the effectiveness of the method. We believe the proposed dynamic filter-based guided attention mechanism will prove valuable for other vision and language tasks as well.
AB - This paper studies the problem of temporal moment localization in a long untrimmed video using natural language as the query. Given an untrimmed video and a query sentence, the goal is to determine the start and end of the relevant visual moment in the video that corresponds to the query sentence. While most previous works have tackled this by a propose-and-rank approach, we introduce a more efficient, end-to-end trainable, and proposal-free approach that is built upon three key components: a dynamic filter which adaptively transfers language information to visual domain attention map, a new loss function to guide the model to attend the most relevant part of the video, and soft labels to cope with annotation uncertainties. Our method is evaluated on three standard benchmark datasets, Charades-STA, TACoS and ActivityNet-Captions. Experimental results show our method outperforms state-of-the-art methods on these datasets, confirming the effectiveness of the method. We believe the proposed dynamic filter-based guided attention mechanism will prove valuable for other vision and language tasks as well.
UR - http://www.scopus.com/inward/record.url?scp=85085522363&partnerID=8YFLogxK
U2 - 10.1109/WACV45572.2020.9093328
DO - 10.1109/WACV45572.2020.9093328
M3 - Conference contribution
T3 - Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020
SP - 2453
EP - 2462
BT - Proceedings - 2020 IEEE Winter Conference on Applications of Computer Vision, WACV 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2020
Y2 - 1 March 2020 through 5 March 2020
ER -