DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture

ArXi:2511.17354v2 Announce Type: replace Recent advances in self-supervised visual representation learning have nstrated the effectiveness of predictive latent-space objectives for learning transferable features. In particular, Image-based Joint-Embedding Predictive Architecture (I-JEPA) learns representations by predicting latent embeddings of masked target regions from visible context. However, it predicts target regions in parallel and all at once, lacking ability to order predictions meaningfully.