AI RESEARCH

Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks

arXiv CS.AI

ArXi:2509.24186v2 Announce Type: replace-cross Accuracy-based evaluation of Large Language Models (LLMs) measures benchmark-specific performance rather than underlying medical competency: it treats all questions as equally informative, conflates model ability with item characteristics, and thereby produces rankings that vary with benchmark choice. To address this, we