Elastic Attention Cores for Scalable Vision Transformers

ArXi:2605.12491v1 Announce Type: cross Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, nstrating that effective visual representations can be learned without any direct patch-to-patch interaction.