Driving Intents Amplify Planning-Oriented Reinforcement Learning

ArXi:2605.12625v1 Announce Type: cross Continuous-action policies trained on a single nstrated trajectory per scene suffer from mode collapse: samples cluster around the nstrated maneuver and the policy cannot represent semantically distinct alternatives. Under preference-based evaluation, this caps best-of-N performance -- even oracle selection cannot recover what the sampling distribution does not contain. We