STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

ArXi:2605.13165v1 Announce Type: new Long chain-of-thought (Long CoT) reasoning improves performance on multi-step problems, but it also induces overthinking: models often generate low-yield reasoning that increases inference cost and latency. This inefficiency is especially problematic in low-data fine-tuning regimes, where real applications adapt reasoning models with limited supervision and cannot rely on large-scale teacher distillation or heavy test-time control.