AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation

ArXi:2509.16952v2 Announce Type: replace-cross The growing volume of academic papers has made it increasingly difficult for researchers to efficiently extract key information. While large language models (LLMs) based agents are capable of automating question answering (QA) workflows for scientific papers, there still lacks a comprehensive and realistic benchmark to evaluate their capabilities. Moreover