VINO: Video-driven Invariance for Non-contextual Objects via Structural Prior Guided De-contextualization

ArXi:2603.07222v1 Announce Type: new Self-supervised learning (SSL) has made rapid progress, yet learned features often over-rely on contextual shortcuts-background textures and co-occurrence statistics. While video provides rich temporal variation, dense in-the-wild streams with strong ego-motion create a co-occurrence trap: foreground objects and background context move coherently, encouraging representations to collapse into scene encoders.