AI RESEARCH

LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation

arXiv CS.CL

ArXi:2507.01449v3 Announce Type: replace Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications.