AI RESEARCH

Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

arXiv CS.CV

ArXi:2604.04722v1 Announce Type: new Large Language Models (LLMs) have achieved remarkable progress across reasoning, generation, and decision-making tasks, yet deploying them on mobile, embedded, and edge devices remains particularly challenging. On-device LLM inference is heavily constrained by the memory and bandwidth overhead of the key-value (KV) cache, which grows linearly with context length and often dominates decoding cost.