GEM: Guided Expectation-Maximization for Behavior-Normalized Candidate Action Selection in Offline RL

ArXi:2603.23232v1 Announce Type: new Offline reinforcement learning (RL) can fit strong value functions from fixed datasets, yet reliable deployment still hinges on the action selection interface used to query them. When the dataset induces a branched or multimodal action landscape, unimodal policy extraction can blur competing hypotheses and yield "in-between" actions that are weakly ed by data, making decisions brittle even with a strong critic. We