AI RESEARCH

Dispatch-Aware Ragged Attention for Pruned Vision Transformers

arXiv CS.AI

ArXi:2604.15408v1 Announce Type: cross Token pruning methods for Vision Transformers (ViTs) promise quadratic reductions in attention FLOPs by dropping uninformative patches. Yet when pruned sequences are executed with state-of-the-art variable-length attention APIs -- including FlashAttention-2's varlen and PyTorch's NestedTensor SDPA-the wall-clock attention latency doesn't scale accordingly.