The Impact of LLM Self-Consistency and Reasoning Effort on Automated Scoring Accuracy and Cost

ArXi:2604.26954v1 Announce Type: cross Strategic model selection and reasoning settings are effective than ensembling for optimizing automated scoring with large language models (LLMs). We examined self-consistency (intra-model majority voting) and reasoning effort for scoring conversation-based assessment items in high school mathematics, evaluating 900 student conversations against human-scored ground truths using frontier and low-cost models from OpenAI and Google.