VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning

ArXi:2509.24773v3 Announce Type: replace-cross Video-conditioned audio generation, including Video-to-Sound (V2S) and Visual Text-to-Speech (VisualTTS), has traditionally been treated as distinct tasks, leaving the potential for a unified generative framework largely underexplored. In this paper, we bridge this gap with VSSFlow, a unified flow-matching framework that seamlessly solve both problems.