Beyond RLHF: A Unified Theoretical Framework of Alignment

ArXi:2506.01523v2 Announce Type: replace Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, existing theories do not provide strong justification for the RLHF objective itself and do not allow comparisons of the guarantees between various methods because different methods are often analyzed under different frameworks. Toward a unified framework for alignment, we ask under what assumptions can we derive existing or new