Scaling Reward Modeling without Human Supervision

ArXi:2603.02225v2 Announce Type: replace Learning from feedback is an instrumental process for advancing the capabilities and safety of frontier models, yet its effectiveness is often constrained by cost and scalability. We present a pilot study that explores scaling reward models through unsupervised approaches. We operationalize reward-based scaling (RBS), in its simplest form, as preference learning over document prefixes and suffixes drawn from large-scale web corpora. Its advantage is nstrated in various aspects: despite using no human annotations.