AI RESEARCH
Exploring Audio Hallucination in Egocentric Video Understanding
arXiv CS.CV
•
ArXi:2604.23860v1 Announce Type: new Egocentric videos provide a distinctive setting in which sound serves as crucial cues to understand user activities and surroundings, particularly when visual information is unstable or occluded due to continuous camera movement. State-of-the-art large audio-visual language models (AV-LLMs) can generate multimodal descriptions. However, we show in this work that they are prone to audio hallucinations, often inferring sounds from visual cues that are visible but not heard.