Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces

ArXi:2604.11996v1 Announce Type: cross Should we trust Large Language Models (LLMs) with high accuracy? LLMs achieve high accuracy on reasoning benchmarks, but correctness alone does not reveal the quality of the reasoning used to produce it. This highlights a fundamental limitation of outcome-based evaluation: models may arrive at correct answers through flawed reasoning, and models with substantially different reasoning capabilities can. nevertheless. exhibit similar benchmark accuracy, for example due to memorization or over-optimization.