AI RESEARCH

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

arXiv CS.AI

ArXi:2601.20309v2 Announce Type: replace-cross Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs.