VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models

ArXi:2510.08618v2 Announce Type: replace-cross Omni-modal large language models (OLLMs) offer a promising end-to-end solution for slide-enhanced speech recognition due to their inherent multimodal capabilities. However, we found a fundamental issue faced by OLLMs: \textit{Visual Interference}, where models show a bias towards visible text over auditory signals, causing them to hallucinate slide content that was never spoken.