We made a coding benchmark that's actually hard to fake. Best result across GPT-5.2, O4-mini, Gemini, Qwen, Kimi with every prompting trick we could think of: 11%.

r/LocalLLaMA
Generative AI Open Source AI AI Research

The idea came from noticing how hard it is to tell what's actually going on when a model "solves" a coding problem. Is it reasoning through the problem or is it pattern matching against the enormous amount of Python and JavaScript it saw during