A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning

ArXi:2510.18814v2 Announce Type: replace-cross Can language models improve their reasoning performance without external rewards, using only their own sampled responses for