vLLM V0 to V1: Correctness Before Corrections in RL

Migration Objective Failure Modes V1 Backend Fixes Logprob Semantics Runtime Defaults Inflight Weight Updates The Remaining Gap: fp32 lm_head Ablations Why We Fixed Backend Correctness First PipelineRL uses vLLM as the inference engine for rollout generation. The inference engine samples tokens and returns token logprobs; the trainer uses those logprobs to compute policy ratios, KL, clip rate, entropy, and reward. Any discrepancy in how those logprobs are computed can change the.