Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents

ArXi:2504.07347v3 Announce Type: replace-cross As demand for Large Language Models (LLMs) and AI agents grows rapidly, optimizing systems for efficient LLM inference becomes critical. While significant efforts have targeted system-level engineering, little has been explored from a mathematical modeling and queueing perspective. In this paper, we develop the queueing fundamentals for LLM inference. In particular, we study the throughput aspect of LLM inference systems.