V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

ArXi:2603.14482v1 Announce Type: new We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the