Stabilizing Efficient Reasoning with Step-Level Advantage Selection

ArXi:2604.24003v1 Announce Type: cross Large language models (LLMs) achieve strong reasoning performance by allocating substantial computation at inference time, often generating long and verbose reasoning traces. While recent work on efficient reasoning reduces this overhead through length-based rewards or pruning, many approaches are post-trained under a much shorter context window than base-model