OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance

ArXi:2605.14458v1 Announce Type: new Omni-modal large language models have nstrated remarkable potential in holistic multimodal understanding; however, the token explosion caused by high-resolution audio and video inputs remains a critical bottleneck for real-time applications and long-form reasoning. Existing omni-modal token compression methods typically prune tokens at the input embedding level, relying on audio-video similarity or temporal co-occurrence as proxies for semantic relevance. In practice, such assumptions are often unreliable. To address this limitation, we propose OmniDrop, a.