TY - GEN
T1 - SCENES: Subpixel Correspondence Estimation With Epipolar Supervision
AU - Kloepfer, Dominik A.
AU - Henriques, Joao F.
AU - Campbell, Dylan
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024/6/12
Y1 - 2024/6/12
N2 - Extracting point correspondences from two or more views of a scene is a fundamental computer vision problem with particular importance for relative camera pose estimation and structure-from-motion. Existing local feature matching approaches, trained with correspondence supervision on large-scale datasets, obtain highly-accurate matches on the test sets. However, they do not generalise well to new datasets with different characteristics to those they were trained on, unlike classic feature extractors. Instead, they require finetuning, which assumes that ground-truth correspondences or ground-truth camera poses and 3D structure are available. We relax this assumption by removing the requirement of 3D structure, e.g., depth maps or point clouds, and only require camera pose information, which can be obtained from odometry. We do so by replacing correspondence losses with epipolar losses, which encourage putative matches to lie on the associated epipolar line. While weaker than correspondence supervision, we observe that this cue is sufficient for finetuning existing models on new data. We then further relax the assumption of known camera poses by using pose estimates in a novel bootstrapping approach. We evaluate on highly challenging datasets, including an indoor drone dataset and an outdoor smartphone camera dataset, and obtain state-of-the-art results without strong supervision.
AB - Extracting point correspondences from two or more views of a scene is a fundamental computer vision problem with particular importance for relative camera pose estimation and structure-from-motion. Existing local feature matching approaches, trained with correspondence supervision on large-scale datasets, obtain highly-accurate matches on the test sets. However, they do not generalise well to new datasets with different characteristics to those they were trained on, unlike classic feature extractors. Instead, they require finetuning, which assumes that ground-truth correspondences or ground-truth camera poses and 3D structure are available. We relax this assumption by removing the requirement of 3D structure, e.g., depth maps or point clouds, and only require camera pose information, which can be obtained from odometry. We do so by replacing correspondence losses with epipolar losses, which encourage putative matches to lie on the associated epipolar line. While weaker than correspondence supervision, we observe that this cue is sufficient for finetuning existing models on new data. We then further relax the assumption of known camera poses by using pose estimates in a novel bootstrapping approach. We evaluate on highly challenging datasets, including an indoor drone dataset and an outdoor smartphone camera dataset, and obtain state-of-the-art results without strong supervision.
KW - correspondence estimation
KW - domain adaptation
KW - domain shift
KW - feature matching
KW - pixel correspondences
KW - relative camera pose
KW - weak supervision
UR - http://www.scopus.com/inward/record.url?scp=85196741384&partnerID=8YFLogxK
U2 - 10.1109/3DV62453.2024.00137
DO - 10.1109/3DV62453.2024.00137
M3 - Conference contribution
AN - SCOPUS:85196741384
SN - 979-8-3503-6246-6
T3 - Proceedings - 2024 International Conference on 3D Vision, 3DV 2024
SP - 21
EP - 30
BT - Proceedings - 2024 International Conference on 3D Vision, 3DV 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 11th International Conference on 3D Vision, 3DV 2024
Y2 - 18 March 2024 through 21 March 2024
ER -