Factorized Latent Dynamics for Video JEPA: An Empirical Study of Auxiliary Objectives

ArXi:2605.17165v1 Announce Type: cross Joint-Embedding Predictive Architectures (JEPA) are a promising framework for self-supervised video representation learning, yet the behavior of auxiliary objectives in small-scale Video