JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion

ArXi:2601.22143v2 Announce Type: replace-cross Audio-Visual Foundation Models, which are pretrained to jointly generate sound and visual content, have recently shown an unprecedented ability to model multi-modal generation and editing, opening new opportunities for downstream tasks. Among these tasks, video dubbing could greatly benefit from such priors, yet most existing solutions still rely on complex, task-specific pipelines that struggle in real-world settings. In this work, we