My experience with testing all frontier open-weight models against GPT and Claude

I spent about a week testing open-weight models for real work, comparing them against what I already know from ChatGPT, Gemini, and Claude. The gap between what benchmarks suggest and what happens when you give these models something to verify is bigger than I expected. The clearest example: I ran an audit of a 66-skill codebase for description quality, routing conflicts, and overlap. Ten models, same files, same OpenCode setup with identical tools and MCPs, everything but ChatGPT is through Ollama Cloud subscription. The answers were in the repo, so I could ground-truth every claim.