I compared 8 AI coding models on the same real-world feature in an open-source TypeScript project. Here are the results

When using AI tools for coding, the question "which model is actually better?" comes up constantly. Synthetic benchmarks often don't reflect reality - models can be specifically trained to pass them. There's a significant difference between solving isolated problems and working with a real codebase, where a model needs to understand requirements, navigate project architecture, correctly integrate new functionality, and not break anything.