FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts

ArXi:2603.19857v1 Announce Type: cross Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-screen sounds, or occluded or partially visible objects.