AI RESEARCH
Bradley-Terry Policy Optimization for Generative Preference Modeling
arXiv CS.LG
•
ArXi:2510.15242v3 Announce Type: replace Reinforcement learning (RL) has recently proven effective at scaling chain-of-thought (CoT) reasoning in large language models for tasks with verifiable answers. However, extending RL-based thought