AI RESEARCH

Safe RLHF Beyond Expectation: Stochastic Dominance for Universal Spectral Risk Control

arXiv CS.AI

ArXi:2603.10938v1 Announce Type: cross Safe Reinforcement Learning from Human Feedback (RLHF) typically enforces safety through expected cost constraints, but the expectation captures only a single statistic of the cost distribution and fails to account for distributional uncertainty, particularly under heavy tails or rare catastrophic events. This limitation is problematic when robustness and risk sensitivity are critical.