Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations

ArXi:2603.20327v1 Announce Type: cross Video world models trained with Joint Embedding Predictive Architectures (JEPA) acquire rich spatiotemporal representations by predicting masked regions in latent space rather than reconstructing pixels. This removes the visual verification pathway of generative models, creating a structural interpretability gap: the encoder has learned physical structure inaccessible in any inspectable form.