AI RESEARCH
Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders
arXiv CS.LG
•
ArXi:2605.16339v1 Announce Type: new Preference learning in large language models relies on reward models as proxies for human judgment. However, these models frequently exhibit preference instability, producing contradictory preference assignments in response to subtle, meaning-preserving input variations. We analyze this instability at the representation level under three semantic-preserving perturbation types: paraphrasing, pattern injection, and backdoor triggers.