AI RESEARCH
Misaligned by Reward: Socially Undesirable Preferences in LLMs
arXiv CS.AI
•
ArXi:2605.05003v1 Announce Type: cross Reward models are a key component of large language model alignment, serving as proxies for human preferences during