The Signal is in the Steps: Local Scoring for Reasoning Data Selection

ArXi:2510.03988v2 Announce Type: replace-cross Distilling long-form reasoning from teacher models into smaller students requires selecting which candidate solutions to train on. Recent work argues that one should select responses the student model assigns highest probability, i.e., favoring solutions ``natural'' to the student. However, we find that this approach works within a single teacher but fails when scaling to long reasoning traces from multiple diverse teachers.