TY - JOUR
T1 - ConsistNet
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
AU - Yang, Jiayu
AU - Cheng, Ziang
AU - Duan, Yunfei
AU - Ji, Pan
AU - Li, Hongdong
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Given a single image of a 3D object, this paper proposes a novel method (named ConsistNet) that can generate multiple images of the same object, as if they are capturedfrom different viewpoints, while the 3D (multi-view) consistencies among those multiple generated images are effectively exploited. Central to our method is a lightweight multi-view consistency block that enables information exchange across multiple single-view diffusion processes based on the underlying multi-view geometry principles. ConsistNet is an extension to the standard latent diffusion model and it consists of two submodules: (a) a view aggregation module that unprojects multi-view features into global 3D volumes and infers consistency, and (b) a ray aggregation module that samples and aggregates 3D consistent features back to each view to enforce consistency. Our approach departs from previous methods in multi-view image generation, in that it can be easily dropped in pretrained LDMs without requiring explicit pixel correspondences or depth prediction. Experiments show that our method effectively learns 3D consistency over a frozen Zero123-XL backbone and can generate 16 surrounding views of the object within 11 seconds on a single A100 GPU. Our code will be made available on https://github.com/JiayuYANG/ConsistNet.
AB - Given a single image of a 3D object, this paper proposes a novel method (named ConsistNet) that can generate multiple images of the same object, as if they are capturedfrom different viewpoints, while the 3D (multi-view) consistencies among those multiple generated images are effectively exploited. Central to our method is a lightweight multi-view consistency block that enables information exchange across multiple single-view diffusion processes based on the underlying multi-view geometry principles. ConsistNet is an extension to the standard latent diffusion model and it consists of two submodules: (a) a view aggregation module that unprojects multi-view features into global 3D volumes and infers consistency, and (b) a ray aggregation module that samples and aggregates 3D consistent features back to each view to enforce consistency. Our approach departs from previous methods in multi-view image generation, in that it can be easily dropped in pretrained LDMs without requiring explicit pixel correspondences or depth prediction. Experiments show that our method effectively learns 3D consistency over a frozen Zero123-XL backbone and can generate 16 surrounding views of the object within 11 seconds on a single A100 GPU. Our code will be made available on https://github.com/JiayuYANG/ConsistNet.
KW - image generation
KW - latent diffusion model
UR - http://www.scopus.com/inward/record.url?scp=85201963317&partnerID=8YFLogxK
U2 - 10.1109/CVPR52733.2024.00676
DO - 10.1109/CVPR52733.2024.00676
M3 - Conference article
AN - SCOPUS:85201963317
SN - 1063-6919
SP - 7079
EP - 7088
JO - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
JF - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Y2 - 16 June 2024 through 22 June 2024
ER -