MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video

ArXi:2605.00495v1 Announce Type: cross Recent advances in multimodal generation have enabled high-quality audio generation from silent videos. Practical applications, such as sound production, demand not only the generated audio but also explicit sound event labels detailing the type and timing of sounds. One straightforward approach involves applying a standard sound event detection to the generated audio. However, this post-hoc pipeline is inherently limited, as it is prone to error accumulation.