Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control

ArXi:2509.15130v3 Announce Type: replace-cross Video diffusion models have rich world priors, but their use in spatial tasks is limited by poor control, spatial-temporal inconsistent results, and entangled scene-camera dynamics. Current approaches, such as per-task fine-tuning or post-process warping, often