Test-Time Speculation

ArXi:2605.09329v1 Announce Type: cross Speculative decoding accelerates LLM inference by using a fast draft model to generate tokens and a accurate target model to verify them. Its performance depends on the $\textit{acceptance length}$, or number of draft tokens accepted by the target. Our studies show that the acceptance length of even state-of-the-art speculators, like DFlash, EAGLE-3 and PARD degrade with generation length, reaching values close to 1 (i.e. no speedup) within just a few thousand output tokens, making speculators ineffective for long-response tasks.