Elastic Attention Cores for Scalable Vision Transformers [R]

Wanted to share our latest paper on an alternative building block for Vision Transformers. Illustration of our model's accuracy and dense features Traditional ViTs utilize dense ( N 2 ) self-attention, which can become pretty costly at higher resolutions. In this work, we propose an alternative backbone with a core-periphery block-sparse attention structure that scales as ( 2NC + N 2 ) for C core tokens. We further train this using nested dropout, which enables test-time elastic adjustments to the inference cost.