AI RESEARCH
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
arXiv CS.CV
•
ArXi:2604.03181v1 Announce Type: cross Robotic manipulation requires understanding both the 3D spatial structure of the environment and its temporal evolution, yet most existing policies overlook one or both. They typically rely on 2D visual observations and backbones pretrained on static image--text pairs, resulting in high data requirements and limited understanding of environment dynamics. To address this, we