AI RESEARCH

Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing

arXiv CS.CL

ArXi:2605.14978v1 Announce Type: new Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model. In practice, speculative efficiency is often bottlenecked by hard-to-draft positions, where an early mismatch truncates the accepted prefix and invalidates the rest of the speculative window. Most learning-based drafters are still optimized with token-level supervised objectives, even though speculative utility is inherently window-level and prefix-sensitive.