FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management

ArXi:2511.00868v2 Announce Type: replace Large Language Model (LLM) serving is increasingly constrained by the growing size of the key-value (KV) cache, which scales with both context length and generation length. Prior work shows that attention is dominated by a small subset of critical tokens, yet existing systems struggle to exploit this efficiently without degrading accuracy, especially in long generation.