Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

ArXi:2504.11320v3 Announce Type: replace-cross Large language models now serve millions of users daily, with providers incurring costs exceeding $700,000 per day. Each request requires token-by-token inference, making GPU scheduling central to latency, capacity, and cost. The difficulty is endogenous memory growth: generated tokens expand the Key-Value (KV) cache, and overflow can evict in-progress requests and waste prior computation.