AI RESEARCH

Faster LLM Inference via Sequential Monte Carlo

arXiv CS.LG

ArXi:2604.15672v1 Announce Type: new Speculative decoding (SD) accelerates language model inference by drafting tokens from a cheap proposal model and verifying them against an expensive target model via rejection sampling. Because rejection truncates the draft block at the first error, throughput degrades when draft and target diverge. Rather than rejecting draft tokens outright, we propose to reweight them. To this end, we