AI RESEARCH
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
arXiv CS.LG
•
ArXi:2605.10453v1 Announce Type: new Speculative decoding speeds up autoregressive generation in Large Language Models (LLMs) through a two-step procedure, where a lightweight draft model proposes tokens which the target model then verifies in a single forward pass. Although the drafter network is small in modern architectures, its LM-head still performs projection to a large vocabulary, becoming one of the major computational bottlenecks. In prior work this issue has been predominantly addressed via static or dynamic vocabulary truncation.