TY - GEN
T1 - Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation
AU - Xu, Ming
AU - Gould, Stephen
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - We propose a novel approach to the action segmentation task for long, untrimmed videos, based on solving an opti-mal transport problem. By encoding a temporal consistency prior into a Gromov- Wasserstein problem, we are able to decode a temporally consistent segmentation from a noisy affinity/matching cost matrix between video frames and action classes. Unlike previous approaches, our method does not require knowing the action order for a video to attain temporal consistency. Furthermore, our resulting (fused) Gromov- Wasserstein problem can be efficiently solved on GPUs using a few iterations of projected mirror descent. We demonstrate the effectiveness of our method in an unsu-pervised learning setting, where our method is used to gen-erate pseudo-labels for self-training. We evaluate our seg-mentation approach and unsupervised learning pipeline on the Breakfast, 50-Salads, YouTube Instructions and Desk-top Assembly datasets, yielding state-of-the-art results for the unsupervised video action segmentation task.
AB - We propose a novel approach to the action segmentation task for long, untrimmed videos, based on solving an opti-mal transport problem. By encoding a temporal consistency prior into a Gromov- Wasserstein problem, we are able to decode a temporally consistent segmentation from a noisy affinity/matching cost matrix between video frames and action classes. Unlike previous approaches, our method does not require knowing the action order for a video to attain temporal consistency. Furthermore, our resulting (fused) Gromov- Wasserstein problem can be efficiently solved on GPUs using a few iterations of projected mirror descent. We demonstrate the effectiveness of our method in an unsu-pervised learning setting, where our method is used to gen-erate pseudo-labels for self-training. We evaluate our seg-mentation approach and unsupervised learning pipeline on the Breakfast, 50-Salads, YouTube Instructions and Desk-top Assembly datasets, yielding state-of-the-art results for the unsupervised video action segmentation task.
KW - long-form video understanding; optimal transport; unsupervised learning; action segmentation; procedural videos
UR - http://www.scopus.com/inward/record.url?scp=85199488416&partnerID=8YFLogxK
U2 - 10.1109/CVPR52733.2024.01385
DO - 10.1109/CVPR52733.2024.01385
M3 - Conference contribution
AN - SCOPUS:85199488416
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 14618
EP - 14627
BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PB - IEEE Computer Society
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Y2 - 16 June 2024 through 22 June 2024
ER -