4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation

ArXi:2605.12027v1 Announce Type: new Reconstructing dynamic 4D scenes from monocular videos is a fundamental yet challenging task. While recent 3D foundation models provide strong geometric priors, their performance significantly degrades in dynamic environments. This degradation stems from a fundamental tension: the inherent coupling of camera ego-motion and object motion within global attention mechanisms. In this paper, we propose a novel