Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game

ArXi:2605.00677v1 Announce Type: new While Large Language Models have achieved notable success on formal mathematics benchmarks such as MiniF2F, it remains unclear whether these results stem from genuine logical reasoning or semantic pattern matching against pre-