AI RESEARCH
Why we no longer evaluate SWE-bench Verified
OpenAI Blog
•
SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and