AI RESEARCH
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving
arXiv CS.LG
•
ArXi:2605.05696v1 Announce Type: cross Agentic LLM workloads put bit-identical tokens at shifted positions every turn, voiding prefix caches at the first byte of divergence. Operators report cache-hit regressions ranging from moderate slowdowns to severe TTFT spikes of 10-16s on unchanged content. Prior position-independent caching systems correct RoPE on the full $d_K$-dimensional key, an architectural cost imposed by GQA, not by caching itself.