A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints

ArXi:2605.04595v1 Announce Type: cross The rapid adoption of large language models (LLMs) has created significant challenges for efficient inference at scale. Unlike traditional workloads, LLM inference is constrained by both computation and the memory overhead of key-value (KV) caching, which accelerates decoding but quickly exhausts GPU memory. In this paper, we