AI RESEARCH
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
arXiv CS.LG
•
ArXi:2605.14220v1 Announce Type: new Modern LLM RL systems separate rollout generation from policy optimization. These two stages are expected to produce token probabilities that match exactly. However, implementation differences can make them assign different values to the same sequence under the same model weights, inducing