Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning

ArXi:2604.18897v1 Announce Type: cross We present a systematic empirical study of prompt engineering for formal mathematical reasoning in the context of the SAIR Equational Theories Our central finding is a single-prompt ceiling: despite substantial engineering effort, balanced hard accuracy plateaus in an empirical saturation region of approximately 60--79% for gpt-oss-120b, compared to a 59.75% no-cheatsheet baseline.