Abstract
We propose a novel approach to the action segmentation task for long, untrimmed videos, based on solving an optimal transport problem. By encoding a temporal consistency prior into a Gromov-Wasserstein problem, we are able to decode a temporally consistent segmentation from a noisy affinity/matching cost matrix between video frames and action classes. Unlike previous approaches, our method does not require knowing the action order for a video to attain temporal consistency. Furthermore, our resulting (fused) Gromov-Wasserstein problem can be efficiently solved on GPUs using a few iterations of projected mirror descent. We demonstrate the effectiveness of our method in an unsupervised learning setting, where our method is used to generate pseudo-labels for self-training. We evaluate our segmentation approach and unsupervised learning pipeline on the Breakfast, 50-Salads, YouTube Instructions and Desktop Assembly datasets, yielding state-of-the-art results for the unsupervised video action segmentation task.
| Original language | English |
|---|---|
| Pages | 1-12 |
| Number of pages | 12 |
| Publication status | Accepted/In press - 4 May 2022 |
| Event | IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024 - Seattle Convention Center, Seattle, United States Duration: 17 Jun 2024 → 21 Jun 2024 https://cvpr.thecvf.com |
Conference
| Conference | IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024 |
|---|---|
| Abbreviated title | CVPR |
| Country/Territory | United States |
| City | Seattle |
| Period | 17/06/24 → 21/06/24 |
| Internet address |