TY - JOUR
T1 - Deep Hierarchical Representation of Point Cloud Videos via Spatio-Temporal Decomposition
AU - Fan, Hehe
AU - Yu, Xin
AU - Yang, Yi
AU - Kankanhalli, Mohan
N1 - Publisher Copyright:
© 1979-2012 IEEE.
PY - 2022/12/1
Y1 - 2022/12/1
N2 - In point cloud videos, point coordinates are irregular and unordered but point timestamps exhibit regularities and order. Grid-based networks for conventional video processing cannot be directly used to model raw point cloud videos. Therefore, in this work, we propose a point-based network that directly handles raw point cloud videos. First, to preserve the spatio-temporal local structure of point cloud videos, we design a point tube covering a local range along spatial and temporal dimensions. By progressively subsampling frames and points and enlarging the spatial radius as the point features are fed into higher-level layers, the point tube can capture video structure in a spatio-temporally hierarchical manner. Second, to reduce the impact of the spatial irregularity on temporal modeling, we decompose space and time when extracting point tube representations. Specifically, a spatial operation is employed to encode the local structure of each spatial region in a tube and a temporal operation is used to encode the dynamics of the spatial regions along the tube. Empirically, the proposed network shows strong performance on 3D action recognition, 4D semantic segmentation and scene flow estimation. Theoretically, we analyse the necessity to decompose space and time in point cloud video modeling and why the network outperforms existing methods.
AB - In point cloud videos, point coordinates are irregular and unordered but point timestamps exhibit regularities and order. Grid-based networks for conventional video processing cannot be directly used to model raw point cloud videos. Therefore, in this work, we propose a point-based network that directly handles raw point cloud videos. First, to preserve the spatio-temporal local structure of point cloud videos, we design a point tube covering a local range along spatial and temporal dimensions. By progressively subsampling frames and points and enlarging the spatial radius as the point features are fed into higher-level layers, the point tube can capture video structure in a spatio-temporally hierarchical manner. Second, to reduce the impact of the spatial irregularity on temporal modeling, we decompose space and time when extracting point tube representations. Specifically, a spatial operation is employed to encode the local structure of each spatial region in a tube and a temporal operation is used to encode the dynamics of the spatial regions along the tube. Empirically, the proposed network shows strong performance on 3D action recognition, 4D semantic segmentation and scene flow estimation. Theoretically, we analyse the necessity to decompose space and time in point cloud video modeling and why the network outperforms existing methods.
KW - Point cloud
KW - action recognition
KW - scene flow estimation
KW - semantic segmentation
KW - spatio-temporal modeling
KW - video analysis
UR - http://www.scopus.com/inward/record.url?scp=85121786794&partnerID=8YFLogxK
U2 - 10.1109/TPAMI.2021.3135117
DO - 10.1109/TPAMI.2021.3135117
M3 - Article
SN - 0162-8828
VL - 44
SP - 9918
EP - 9930
JO - IEEE Transactions on Pattern Analysis and Machine Intelligence
JF - IEEE Transactions on Pattern Analysis and Machine Intelligence
IS - 12
ER -