OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models

ArXi:2511.14582v2 Announce Type: replace Omnimodal large language models (OmniLLMs) have attracted increasing research attention of late towards unified audio-video understanding. However, the high computational cost of processing longer joint audio-video token sequences has become a key bottleneck. Existing token compression methods have not addressed the emerging need to jointly compress multimodal tokens. To bridge this gap, we present OmniZip, a