EAGLE-Pangu: Accelerator-Safe Tree Speculative Decoding on Ascend NPUs

ArXi:2603.08088v1 Announce Type: new Autoregressive decoding remains a primary bottleneck in large language model (LLM) serving, motivating speculative decoding methods that reduce expensive teacher-model invocations by verifying multiple candidate tokens per step. Tree-structured speculation further increases parallelism, but is often brittle when ported across heterogeneous backends and accelerator stacks, where attention masking, KV-cache layouts, and indexing semantics are not interchangeable.