Arithmetic OOD Failure Unfolds in Stages in Minimal GPTs

ArXi:2603.26828v1 Announce Type: cross Arithmetic benchmarks are often reduced to a single held-out score, but that score can conflate qualitatively different failures. We study a controlled minimal GPT trained on exhaustive 2-digit addition, where all local digit transitions are already present in