Preserve Support, Not Correspondence: Dynamic Routing for Offline Reinforcement Learning

ArXi:2604.22229v1 Announce Type: cross One-step offline RL actors are attractive because they avoid backpropagating through long iterative samplers and keep inference cheap, but they still have to improve under a critic without drifting away from actions that the dataset can. In recent one-step extraction pipelines, a strong iterative teacher provides one target action for each latent draw, and the same student output is asked to do both jobs: move toward higher Q and stay near that paired endpoint.