An Agentic Evaluation Framework for AI-Generated Scientific Code in PETSc

ArXi:2603.15976v1 Announce Type: new While large language models have significantly accelerated scientific code generation, comprehensively evaluating the generated code remains a major challenge. Traditional benchmarks reduce evaluation to test-case matching, an approach insufficient for library code in HPC where solver selection, API conventions, memory management, and performance are just as critical as functional correctness. To address this gap, we