AMiD: Knowledge Distillation for LLMs with $\alpha$-mixture Assistant Distribution

ArXi:2510.15982v3 Announce Type: replace Autoregressive large language models (LLMs) have achieved remarkable improvement across many tasks but incur high computational and memory costs. Knowledge distillation (KD) mitigates this issue by transferring knowledge from a large teacher to a smaller student through distributional alignment. Previous studies have proposed various discrepancy metrics, but the capacity gap and