TokenButler: Token Importance is Predictable

ArXi:2503.07518v2 Announce Type: replace-cross Large Language Models (LLMs) rely on the Key-Value (KV) Cache to token history, enabling efficient decoding of tokens. As the KV-Cache grows, it becomes a major memory and computation bottleneck. However, there is an opportunity to alleviate this bottleneck, prior research has shown that only a small subset of tokens contribute meaningfully to each decoding step. A key challenge in finding these critical tokens is that they are dynamic, and heavily input query-dependent.