Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training

ArXi:2603.00518v2 Announce Type: replace Learning efficient and expressive visual representation has long been the pursuit of computer vision research. While Vision Transformers (ViTs) gradually replace traditional Convolutional Neural Networks (CNNs) as scalable vision learners, their applications are plagued by the quadratic complexity of the self-attention mechanism. To address the challenge, we