Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

ArXi:2605.19228v1 Announce Type: cross Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating step-by-step solutions, but diagnosing where a multi-step reasoning trace might fail remains difficult. Confidence estimation offers a diagnostic signal, yet existing methods are restricted to final answers or require internal model access. In this paper, we