Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

ArXi:2605.20061v1 Announce Type: new Reinforcement learning from verifiable rewards (RLVR) is a promising paradigm for improving large language model (LLM) agents on long-horizon interactive tasks. However, in partially observable environments, incomplete observations cause agent beliefs to drift over time, while delayed rewards obscure the causal impact of intermediate decisions, exacerbating temporal credit assignment challenges.