Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

ArXi:2605.11609v1 Announce Type: cross On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere.