Step-GRPO: Internalizing Dynamic Early Exit for Efficient Reasoning

ArXi:2604.16890v1 Announce Type: new Large reasoning models that use long chain-of-thought excel at problem-solving yet waste compute on redundant checks. Curbing this overthinking is hard