AI RESEARCH
Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference
arXiv CS.AI
•
ArXi:2604.09613v1 Announce Type: cross Production vLLM fleets provision every instance for worst-case context length, wasting 4-8x concurrency on the 80-95% of requests that are short and simultaneously triggering KV-cache failures -- OOM crashes, preemption storms, and request rejections. Both problems share a single root cause: configuration-traffic mismatch.