AI RESEARCH
Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving
arXiv CS.CL
•
ArXi:2604.08075v1 Announce Type: new Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency. In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8$\times$ throughput capacity and triggering reliability issues such as OOM crashes, preemption, and request rejections. We identify a common root cause for these inefficiencies: configuration-traffic mismatch.