From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level

ArXi:2601.03731v3 Announce Type: replace-cross As large language models (LLMs) evolve into autonomous agents, evaluating repository-level reasoning, the ability to maintain logical consistency across massive, real-world, interdependent file systems, has become critical. Current benchmarks typically fluctuate between isolated code snippets and black-box evaluations. We present RepoReason, a white-box diagnostic benchmark centered on abductive assertion verification.