Rethinking Temporal Consistency in Video Object-Centric Learning: From Prediction to Correspondence

ArXi:2605.03650v1 Announce Type: cross The de facto approach in video object-centric learning maintains temporal consistency through learned dynamics modules that predict future object representations, called slots. We nstrate that these predictors function as expensive approximations of discrete correspondence problems. Modern self-supervised vision backbones already encode instance-discriminative features that distinguish objects reliably. Exploiting these features eliminates the need for learned temporal prediction. We.