AdaSwitch: Balancing Exploration and Guidance in Knowledge Distillation via Adaptive Switching

ArXi:2510.07842v2 Announce Type: replace-cross Small language models (SLMs) are crucial for applications with strict latency and computational constraints, yet achieving high performance remains challenging. Knowledge distillation (KD) can transfer capabilities from large teacher models, but existing methods face a dilemma: off-policy distillation provides high-quality supervision but suffers from exposure bias (