Benchmarking Testing in Automated Theorem Proving

ArXi:2604.23698v1 Announce Type: new Recent advances in large language models (LLMs) have shown promise in formal theorem proving, yet evaluating semantic correctness remains challenging. Existing evaluations rely on indirect proxies such as lexical overlap with human-annotated proof, or expensive manual inspection.