OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

ArXi:2602.04804v2 Announce Type: replace Omni-modal Large Language Models (Omni-LLMs) have nstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs.