AI RESEARCH
ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios
arXiv CS.AI
•
ArXi:2604.09603v1 Announce Type: cross Speculative Decoding promises to accelerate the inference of Large Language Models, yet its efficacy often degrades in production-grade serving. Existing evaluations typically overlook the compute-bound nature of high-concurrency regimes, where verification compute becomes the dominant bottleneck. Consequently, prior methods face a dilemma: static trees incur massive verification waste, while dynamic trees suffer from cumulative misjudgments and kernel incompatibility. To bridge this gap, we.