CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading

ArXi:2603.11957v1 Announce Type: new Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We