SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

ArXi:2603.22057v1 Announce Type: new Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and. therefore. often fail to capture 3D spatial relationships between objects and backgrounds in the real world, cons