When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment

ArXi:2603.29559v1 Announce Type: new Large Language Models (LLMs) show promise for automated grading, but their outputs can be unreliable. Rather than improving grading accuracy directly, we address a complementary problem: \textit{predicting when an LLM grader is likely to be correct}. This enables selective automation where high-confidence predictions are processed automatically while uncertain cases are flagged for human review.