AI RESEARCH
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
arXiv CS.LG
•
ArXi:2604.13602v1 Announce Type: new Reinforcement Learning from Human Feedback (RLHF) and related alignment paradigms have become central to steering large language models (LLMs) and multimodal large language models (MLLMs) toward human-preferred behaviors. However, these approaches