AI RESEARCH
A Persistent-State Dataflow Accelerator for Memory-Bound Linear Attention Decode on FPGA
arXiv CS.LG
•
ArXi:2603.05931v1 Announce Type: cross Gated DeltaNet (GDN) is a linear attention mechanism that replaces the growing KV cache with a fixed-size recurrent state. Hybrid LLMs like Qwen3-Next use 75% GDN layers and achieve competitive accuracy to attention-only models. However, at batch-1, GDN decode is memory-bound on GPUs since the full recurrent state must be round-tripped through HBM every token.