Attn-QAT: 4-Bit Attention With Quantization-Aware Training

ArXi:2603.00040v2 Announce Type: replace Achieving reliable 4-bit attention is a prerequisite for end-to-end FP4 computation on emerging FP4-capable GPUs, yet attention remains the main obstacle due to FP4's tiny dynamic range and attention's heavy-tailed activations. This paper presents the first systematic study of 4-bit quantization-aware