Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders

ArXi:2605.16339v1 Announce Type: new Preference learning in large language models relies on reward models as proxies for human judgment. However, these models frequently exhibit preference instability, producing contradictory preference assignments in response to subtle, meaning-preserving input variations. We analyze this instability at the representation level under three semantic-preserving perturbation types: paraphrasing, pattern injection, and backdoor triggers.