Fine-tuned Qwen3 SLMs (0.6-8B) beat frontier LLMs on narrow tasks

We spent a while putting together a systematic comparison of small distilled Qwen3 models (0.6B to 8B) against frontier APIs - GPT-5 nano/mini/5.2, Gemini 2.5 Flash Lite/Flash, Claude Haiku 4.5/Sonnet 4.6/Opus 4.6, Grok 4.1 Fast/Grok 4 - across 9 datasets spanning classification, function calling, QA, and open-book QA. All distilled models were trained using open-weight teachers only (no frontier API outputs in the