RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

ArXi:2602.18196v2 Announce Type: replace Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of the attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. However, we find a persistent failure mode of them -sparsifying a pretrained attention model to a dilated pattern leads to severe accuracy degradation. We