TriangleMix: Accelerating Prefilling via Decoding-time Contribution Sparsity

ArXi:2507.21526v3 Announce Type: replace Large Language Models (LLMs) incur quadratic attention complexity with input length, creating a major time bottleneck in the prefilling stage. Existing acceleration methods largely exploit attention score sparsity by estimating blocks with high attention scores and applying dynamic sparse attention.