From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

ArXi:2603.10877v1 Announce Type: new Knowledge distillation (KD) methods are pivotal in compressing large pre-trained language models into smaller models, ensuring computational efficiency without significantly dropping performance. Traditional KD techniques assume homogeneity in modalities between the teacher (source) and the student (target) models. On the other hand, existing multimodal knowledge distillation methods require modality-specific pre-