Gemma4 Speculative Decoding with n-gram

Dev.to AI
AI Hardware AI Tools

Using the MCP Toolset for benchmarking- the 26B MOE Gemma4 model was updated with ngram speculative decoding. The latest Gemma4 assistant models with the full speculative decoding are not ed yet by vLLM serving on TPU- so ngram was used for speculative decoding. Hardware: Each TPU v6e chip (Trillium) has 32GB of HBM. v6e-4 (Your Current Setup): Total 128GB HBM. Model Weights: In bfloat16, the 26B model takes approximately 52GB. Headroom: This leaves you with ~76GB for the KV cache and activation buffers.