AI RESEARCH

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

arXiv CS.AI

ArXi:2605.09253v1 Announce Type: cross While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Distillation (OPD) remains largely unexplored. In this work, we investigate high-loss tokens, a token type that--as the most direct signal of student-teacher mismatch under OPD's per-token KL objective--should progressively diminish as.