ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

ArXi:2603.22281v1 Announce Type: cross Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility.