Evaluating agents for scientific discovery (7 minute read)

Many teams are claiming extraordinary things about their agents. The evidence behind these claims is usually disappointing. ScienceWorld and DiscoveryWorld are benchmarks developed to test whether AI agents can actually do science. ScienceWorld asks whether agents can 're-make' classic scientific discoveries at roughly an elementary school level, while DiscoveryWorld tests open-ended discovery at a college or PhD level. These benchmarks, open and freely available, help test what science agents are actually capable of.