PAINT: Partial-Solution Adaptive Interpolated Training for Self-Distilled Reasoners

ArXi:2604.26573v1 Announce Type: new Improving large language model (LLM) reasoning requires supervision that is both aligned with the model's own test-time states and informative at the token level. Reinforcement learning with verifiable rewards provides on-policy exploration but offers sparse, high-variance credit; supervised fine-tuning and distillation provide dense targets but often train on fixed trajectories or rely on stronger teachers.