AI RESEARCH
Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling
arXiv CS.AI
•
ArXi:2604.18103v1 Announce Type: new Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long-context settings. While token pruning reduces sequence length, prior methods rely on heuristics that break compatibility with hardware-efficient kernels like FlashAttention. In this work, we observe that tokens evolve toward \textit{semantic fixing points}, making further processing redundant. To this end, we.