AI RESEARCH

MEDiC: Multi-objective Exploration of Distillation from CLIP

arXiv CS.CV

ArXi:2603.29009v1 Announce Type: new Masked image modeling (MIM) methods typically operate in either raw pixel space (reconstructing masked patches) or latent feature space (aligning with a pre-trained teacher). We present MEDiC (Multi-objective Exploration of Distillation from CLIP), a framework that combines both spaces in a single pipeline through three complementary objectives: patch-level token distillation from a frozen CLIP encoder, global CLS alignment, and pixel reconstruction via a lightweight decoder.