AI RESEARCH

FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

arXiv CS.CL

ArXi:2605.20022v1 Announce Type: new Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting between drafting and verification, and repeated exchange of intermediate states further increases memory access overhead.