Evaluation-driven Scaling for Scientific Discovery

ArXi:2604.19341v1 Announce Type: cross Language models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively refine them. At the core of these trial-and-error loops lies evaluation: the process of obtaining feedback on candidate solutions via verifiers, simulators, or task-specific scoring functions.