AI RESEARCH

HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

arXiv CS.AI

ArXi:2604.05887v1 Announce Type: new Multimodal Large Language Models (MLLMs) have advanced unified reasoning over text, images, and videos, but their inference is hindered by the rapid growth of key-value (KV) caches. Each visual input expands into thousands of tokens, causing caches to scale linearly with context length and remain resident in GPU memory throughout decoding, which leads to prohibitive memory overhead and latency even on high-end GPUs.