AI RESEARCH
An Interpretable Latency Model for Speculative Decoding in LLM Serving
arXiv CS.LG
•
ArXi:2605.15051v1 Announce Type: new Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller draft model to propose multiple tokens that are verified by a larger target model in parallel. While prior work nstrates substantial speedups in isolated or fixed-batch settings, the behavior of SD in production serving systems remains poorly understood: request load varies over time, and effective batch size emerges from the serving system rather than being directly controlled or observed.