AI RESEARCH

Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving

arXiv CS.LG

ArXi:2605.05696v1 Announce Type: cross Agentic LLM workloads put bit-identical tokens at shifted positions every turn, voiding prefix caches at the first byte of divergence. Operators report cache-hit regressions ranging from moderate slowdowns to severe TTFT spikes of 10-16s on unchanged content. Prior position-independent caching systems correct RoPE on the full $d_K$-dimensional key, an architectural cost imposed by GQA, not by caching itself.