ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering

ArXi:2510.09351v2 Announce Type: replace While Small Language Models (SLMs) have nstrated promising performance on an increasingly wide array of commonsense reasoning benchmarks, current evaluation practices rely almost exclusively on the accuracy of their final answers, neglecting the validity of the reasoning processes that lead to those answers. To address this issue, we present ReTraceQA, a novel benchmark that