Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

ArXi:2605.12034v1 Announce Type: cross Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audio-visual-language evidence integration, and how post-