AI RESEARCH
Lil: Less is Less When Applying Post-Training Sparse-Attention Algorithms in Long-Decode Stage
arXiv CS.LG
•
ArXi:2601.03043v3 Announce Type: replace-cross Large language models (LLMs) nstrate strong capabilities across a wide range of complex tasks and are increasingly deployed at scale, placing significant demands on inference efficiency. Prior work typically decomposes inference into prefill and decode stages, with the decode stage dominating total latency. To reduce time and memory complexity in the decode stage, a line of work