Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

ArXi:2605.16302v1 Announce Type: new Reinforcement learning for multi-step reasoning with large language models (LLMs) often relies on sparse terminal rewards, leading to poor credit assignment conditions where the final feedback is evenly propagated across all intermediate decisions. This results in high gradient variance, unstable