FlashAttention-4: 1613 TFLOPs/s, 2.7x faster than Triton, written in Python. What it means for inference.

Wrote a deep dive on FlashAttention-4 (03/05/2026) that's relevant for anyone thinking about inference performance. TL;DR for inference: BF16 forward: 1,613 TFLOPs/s on B200 (71% utilization). Attention is basically at matmul speed now. 2.1-2.7x faster than Triton, up to 1.3x faster than cuDNN 9.13 vLLM 0.17.0 (released March 7) integrates FA-4. If you're on B200, it's automatic.