ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

ArXi:2605.15224v1 Announce Type: new Large language model-based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the critique's guidance into its underlying capability. Meanwhile, a frozen critic cannot improve its feedback quality over time, limiting the potential for iterative self-improvement.