DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval

ArXi:2604.19432v1 Announce Type: new Vision foundation models have shown great promise for open-set 3D object retrieval (3DOR) through efficient adaptation to multi-view images. Leveraging semantically aligned latent space, previous work typically adapts the CLIP encoder to build view-based 3D descriptors. Despite CLIP's strong generalization ability, its lack of fine-grainedness prompted us to explore the potential of a recent self-supervised encoder