Stage-adaptive Token Selection for Efficient Omni-modal LLMs

ArXi:2605.20035v1 Announce Type: new Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although