Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training

ArXi:2604.16723v1 Announce Type: cross Large Language Models (LLMs) have nstrated potential in automating scientific ideation, yet current approaches relying on iterative prompting or complex multi-agent architectures often suffer from hallucination or computational inefficiency. A critical bottleneck in applying Reinforcement Learning (RL) to this open-ended domain is reward hacking -- where models exploit imperfect evaluation proxies to maximize scores without producing genuine scientific innovation.