1 million tokens per second from a single cluster, what that actually means

Got Qwen 3.5 27B to 1,103,941 tok/s on 12 nodes with 96 B200 GPUs. At that rate you process 50,000 insurance policy documents in hours instead of weeks. 16K concurrent users with sub-50ms per-token latency. This is a 27B open-weight model, not a frontier one. No custom kernels, just vLLM v0.18.0 out of the box. GDN kernel optimizations and disaggregated prefill/decode are still coming -- today's numbers are the floor. disclosure: I work for Google Cloud. submitted by /u/m4r1k_ [link] [comments.