Qwen 3.5 27B at 1.1M tok/s on B200s, all configs on GitHub

r/LocalLLaMA
AI Hardware Open Source AI AI Tools

Pushed Qwen 3.5 27B (the dense one, not MoE) to 1,103,941 tok/s on 12 nodes with 96 B200 GPUs using vLLM. 9,500 to 95K per node came from four changes: DP=8 over TP=8, context window from 131K to 4K, FP8 KV cache, and MTP-1 speculative decoding. That last one was the biggest -- without MTP, GPU utilization was 0%. Scaling: 97.1% efficiency at 8 nodes, 96.5% at 12. ClusterIP round-robin. The Inference Gateway with KV-cache-aware routing added 35% overhead, so we didn't use it. No custom kernels, vLLM v0.18.0 out of the box. GDN kernel optimizations still coming upstream.