Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation

ArXi:2602.20816v2 Announce Type: replace-cross The core learning signal used in language model distillation is the standard Kullback-Leibler (KL) divergence between the student and teacher distributions. Traditional KL divergence tends to be dominated by the next tokens with the highest probabilities, i.e., the teacher's modes, thereby diminishing the influence of less probable yet potentially informative components of the output distribution.