PAGE-4D: Disentangled Pose and Geometry Estimation for VGGT-4D Perception

ArXi:2510.17568v4 Announce Type: replace Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we