Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

ArXi:2605.14220v1 Announce Type: new Modern LLM RL systems separate rollout generation from policy optimization. These two stages are expected to produce token probabilities that match exactly. However, implementation differences can make them assign different values to the same sequence under the same model weights, inducing