s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

ArXi:2603.14628v1 Announce Type: cross Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself nstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light.