AI RESEARCH

STAR: Decode-Phase Rescheduling for LLM Inference

arXiv CS.LG

ArXi:2510.13668v2 Announce Type: replace-cross Large Language Model (LLM) inference has emerged as a fundamental paradigm, however, variations in output length cause severe workload imbalance in the decode phase, particularly for long-output reasoning tasks. Existing systems, such as PD disaggregation architectures, rely on static prefill-to-decode scheduling, which often results in SLO violations and OOM failures under evolving decode workloads. In this paper, we propose STAR, a decode rescheduling system powered by length prediction to anticipate future workloads.