A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets

ArXi:2605.08946v1 Announce Type: new Preference-conditioned multi-objective reinforcement learning aims to learn a single policy that captures trade-offs across preferences, but under nonlinear scalarization the uniqueness and continuity of the preference-to-solution correspondence remain unclear. We study this problem in tabular multi-objective Marko decision processes (MDPs) using smooth Tchebycheff scalarization as a monotone utility.