Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation

ArXi:2604.11611v1 Announce Type: cross To overcome the sparse reward challenge in reinforcement learning (RL) for agents based on large language models (LLMs), we propose Mutual Information Self-Evaluation (MISE), an RL paradigm that utilizes hindsight generative self-evaluation as dense reward signals while simultaneously calibrating them against the environmental feedbacks. Empirically, MISE enables an agent to learn autonomously from dense internal rewards supplementing sparse extrinsic signals.