A Judge Agent Closes the Reliability Gap in AI-Generated Scientific Simulation

ArXi:2603.25780v1 Announce Type: cross Large language models can generate scientific simulation code, but the generated code silently fails on most non-textbook problems. We show that classical mathematical validation -- well-posedness, convergence, and error certification -- can be fully automated by a Judge Agent, reducing the silent-failure rate from 42% to 1.5% across 134 test cases spanning 12 scientific domains.