Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback

ArXi:2604.19024v1 Announce Type: new Safe Reinforcement Learning from Human Feedback (Safe RLHF) has recently achieved empirical success in developing helpful and harmless large language models by decoupling human preferences regarding helpfulness and harmlessness. Existing approaches typically rely on fitting fixed horizon reward models from human feedback and have only been validated empirically.