Toward Evaluation Frameworks for Multi-Agent Scientific AI Systems

ArXi:2603.26718v1 Announce Type: cross We analyze the challenges of benchmarking scientific (multi)-agentic systems, including the difficulty of distinguishing reasoning from retrieval, the risks of data/model contamination, the lack of reliable ground truth for novel research problems, the complications