InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

ArXi:2604.13201v1 Announce Type: cross Large language models are emerging as scientific assistants, but evaluating their ability to reason from empirical data remains challenging. Benchmarks derived from published studies and human annotations inherit publication bias, known-knowledge bias, label noise, and substantial storage requirements. We present InfiniteScienceGym, a procedurally generated benchmark of scientific repositories paired with a verifiable question-answering task.