KV FP8 with Gemma4 26B

Dev.to AI
AI Hardware Open Source AI AI Tools

✦ The vLLM service is now Online and healthy! 🟢 Final Status: vLLM Health: 🟢 200 OK Active Endpoint: Model: google/gemma-4-26B-A4B-it Optimizations: KV FP8 Enabled, bfloat16, Speculative Decoding (ngram). Key Observations High Prefill Throughput: The TPU v6e cluster scaled efficiently under load. At max concurrency (1024 users) with a 16,384 context length, it hit an impressive 475,552 tokens per second (tok/s) prefill rate. TTFT Scaling: Time-to-first-token gracefully increased as expected with concurrency.