AI RESEARCH
Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference
arXiv CS.LG
•
ArXi:2605.06046v1 Announce Type: new Auto-regressive token generation in large language models is memory-bound because it requires "attending to" key and value tensors (KV cache) of all previous tokens. Prior work aims to improve the efficiency of this decode process by batching multiple requests together, and maximizing batch size subject to GPU memory constraints.