AI RESEARCH
CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference
arXiv CS.AI
•
ArXi:2604.08584v1 Announce Type: cross Long-context LLMs increasingly rely on extended, reusable prefill prompts for agents and domain Q&A, pushing attention and KV-cache to become the dominant decode-time bottlenecks. While sparse attention reduces computation and transfer costs, it often struggles to maintain accuracy at high sparsity levels due to the inherent distribution shift between Queries and Keys. We propose Centroid-Scoring Attention (CSAttention), a