Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning

ArXi:2601.10306v2 Announce Type: replace-cross While Reinforcement Learning (RL) has advanced LLM reasoning, applying it to long-context scenarios is hindered by sparsity of outcome rewards. This limitation fails to penalize ungrounded "lucky guesses," leaving the critical process of needle-in-a-haystack evidence retrieval largely unsupervised. To address this, we propose EAPO (Evidence-Augmented Policy Optimization