ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

ArXi:2604.15086v1 Announce Type: cross Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecise stylistic control due to entangled temporal and timbre information in reference audio. Moreover, the lack of standardized benchmarks limits systematic evaluation.