Selective Off-Policy Reference Tuning with Plan Guidance

ArXi:2605.11505v1 Announce Type: new Reinforcement learning with verifiable rewards helps reasoning, but GRPO-style methods stall on hard prompts where all sampled rollouts fail. SORT adds a repair update for those failures without changing rollout generation: it derives a plan from the reference solution, compares token probabilities with and without that plan, and gives higher weight to tokens that become predictable under plan conditioning. This turns all-wrong prompts into selective, structure-aware learning signals instead of uniform imitation.