AI RESEARCH

Regulating Branch Parallelism in LLM Serving

arXiv CS.AI

ArXi:2605.06914v1 Announce Type: cross Recent methods expose intra-request parallelism in LLM outputs, allowing independent branches to decode concurrently. Existing serving systems execute these branches eagerly or under fixed caps. We show that both are brittle: eager admission inflates the shared decode step, degrading co-batched requests in serial stages, while conservative fixed caps forgo the throughput that motivated exposing branches in the first place.