CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation

ArXi:2603.25383v1 Announce Type: new CLIP aligns image and text embeddings via contrastive learning and nstrates strong zero-shot generalization. Its large-scale architecture requires substantial computational and memory resources, motivating the distillation of its capabilities into lightweight student models. However, existing CLIP distillation methods do not explicitly model multi-directional relational dependencies between teacher and student embeddings, limiting the student's ability to preserve the structural relationships encoded by the teacher.