AI RESEARCH

CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration

arXiv CS.AI

ArXi:2605.11186v1 Announce Type: cross Auto-regressive decoding in Large Language Models (LLMs) is inherently memory-bound: every generation step requires loading the model weights and intermediate results from memory (e.g., High-Bandwidth Memory (HBM) for GPU servers), making throughput bottlenecked by memory bandwidth rather than compute. Speculative decoding addresses this by enabling parallel verification of multiple draft tokens, effectively amortizing the cost of each target-model call.