Self-Supervised Multi-Modal World Model with 4D Space-Time Embedding

ArXi:2603.07039v1 Announce Type: new We present DeepEarth, a self-supervised multi-modal world model with Earth4D, a novel planetary-scale 4D space-time positional encoder. Earth4D extends 3D multi-resolution hash encoding to include time, efficiently scaling across the planet over centuries with sub-meter, sub-second precision. Multi-modal encoders (e.g. vision-language models) are fused with Earth4D embeddings and trained via masked reconstruction. We nstrate Earth4D's expressive power by achieving state-of-the-art performance on an ecological forecasting benchmark.