AI RESEARCH
Regulating Branch Parallelism in LLM Serving
arXiv CS.AI
•
ArXi:2605.06914v1 Announce Type: cross Recent methods expose intra-request parallelism in LLM outputs, allowing independent branches to decode concurrently. Existing serving systems execute these branches eagerly or under fixed caps. We show that both are brittle: eager admission inflates the shared decode step, degrading co-batched requests in serial stages, while conservative fixed caps forgo the throughput that motivated exposing branches in the first place.