Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory

ArXi:2605.00238v1 Announce Type: new Automated short answer grading (ASAG) with large language models (LLMs) is commonly evaluated with aggregate metrics such as macro-F1 and Cohen's kappa. However, these metrics provide limited insight into how grading performance varies across student responses of differing grading difficulty. We