MoVieDrive: Urban Scene Synthesis with Multi-Modal Multi-View Video Diffusion Transformer

ArXi:2508.14327v2 Announce Type: replace Urban scene synthesis with video generation models has recently shown great potential for autonomous driving. Existing video generation approaches to autonomous driving primarily focus on RGB video generation and lack the ability to multi-modal video generation. However, multi-modal data, such as depth maps and semantic maps, are crucial for holistic urban scene understanding in autonomous driving.