Qwen 3 8B topped 6 of 13 hard evals against models 4x its size, blind peer eval of 10 SLMs

I ran 13 blind peer evaluations today testing 10 small language models on hard frontier-level questions. Not summarization or trivia. Distributed lock debugging, Go concurrency bugs, SQL optimization, Bayesian medical diagnosis, Simpson's Paradox, Arrow's voting theorem, and survivorship bias analysis. The same difficulty level I use for GPT-5.4 and Claude Opus 4.6. The results surprised me. I ran the numbers twice because the 8B model kept winning.