Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

ArXi:2604.20819v1 Announce Type: new The scalability of long-context large language models is fundamentally limited by the quadratic memory cost of exact self-attention, which often leads to out-of-memory (OOM) failures on modern hardware. Existing methods improve memory efficiency to near-linear complexity, while assuming that the full query, key, and value tensors fit in device memory. In this work, we remove this assumption by