AI RESEARCH

KVBuffer: IO-aware Serving for Linear Attention

arXiv CS.AI

ArXi:2605.19049v1 Announce Type: cross Linear attention has recently gained significant attention for long-context inference due to its constant decoding cost with respect to context length. However, existing serving systems typically serve linear attention by recurrently computing and updating a large linear attention state in every decoding step. Since the state is much larger than the per-token key and value, recurrent decoding incurs substantial memory access and becomes inefficient for serving linear attention.