We made a coding benchmark that's actually hard to fake. Best result across GPT-5.2, O4-mini, Gemini, Qwen, Kimi with every prompting trick we could think of: 11%.
r/LocalLLaMA
•
Generative AI
Open Source AI
AI Research
The idea came from noticing how hard it is to tell what's actually going on when a model "solves" a coding problem. Is it reasoning through the problem or is it pattern matching against the enormous amount of Python and JavaScript it saw during