Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch

ArXi:2603.22056v1 Announce Type: new Large language models (LLMs) achieve state-of-the-art (SOTA) performance across language tasks, but are costly to deploy due to their size and resource demands. Knowledge Distillation (KD) addresses this by