Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

ArXi:2603.21563v1 Announce Type: new Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles and aggregating diverse hypotheses. Yet, reinforcement learning (RL) for such systems is often undermined by credit assignment: a shared global reward obscures individual contributions, inflating update variance and encouraging free-riding. We