Qwen3.5-9B on document benchmarks: where it beats frontier models and where it doesn't.

We run an open document AI benchmark. 20 models, 9,000+ real documents. Just added all four Qwen3.5 sizes (0.8B to 9B). Now we have per-task breakdowns for every model. You can see the results here: idp-leaderboard.org Where all Qwen wins or matches: OlmOCR (text extraction from messy scans, dense PDFs, multi-column layouts): Qwen3.5-9B: 78.1 Qwen3.5-4B: 77.2 Gemini 3.1 Pro: 74.6 Claude Sonnet 4.6: 74.4 Qwen3.5-2B: 73.7 GPT-5.4: 73.4 9B and 4B are ahead of every frontier model on raw text extraction. The 2B matches GPT-5.4.