AI RESEARCH

Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training

arXiv CS.LG

ArXi:2604.01597v1 Announce Type: new Traditional RL algorithms like Proximal Policy Optimization (PPO) typically train on the entire rollout buffer, operating under the assumption that all generated episodes provide a beneficial optimization signal. However, these episodes frequently contain noisy or unfaithful reasoning, which can degrade model performance and slow down