When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

ArXi:2605.00817v1 Announce Type: new Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through a controlled diagnostic benchmark for procedural execution, where models are given a step-wise arithmetic algorithm and two numeric inputs, and must return the final computed value.