LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models

ArXi:2605.10641v1 Announce Type: cross Large Vision-Language Models (VLMs) are successful in addressing a multitude of vision-language understanding tasks, such as Visual Question Answering (VQA), but their memory and compute requirements remain a concern for practical deployment. A promising class of techniques for mitigating this concern is Knowledge Distillation, where knowledge from a high-capacity Teacher network is transferred to a considerably smaller Student network.