AI RESEARCH
Generalization Limits of Reinforcement Learning Alignment
arXiv CS.AI
•
ArXi:2604.02652v1 Announce Type: cross The safety of large language models (LLMs) relies on alignment techniques such as reinforcement learning from human feedback (RLHF). However, recent theoretical analyses suggest that reinforcement learning-based