RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding

ArXi:2604.14885v1 Announce Type: cross Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess-and-verify strategy, but existing