RLHF in an SFT Way: From Optimal Solution to Reward-Weighted Alignment

ArXi:2502.11026v3 Announce Type: replace-cross Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption, specifically for online sampling-based methods like Proximal Policy Optimization (PPO) and Group Relative Policy Optimization