Got Gemma 4 running locally on CUDA, both float and GGUF quantized, with benchmarks

Spent the last week getting Gemma 4 working on CUDA with both full-precision (BF16) and GGUF quantized inference. Here's a video of it running. Sharing some findings because this model has some quirks that aren't obvious. Performance (Gemma4 E2B, RTX 3090): | Config | BF16 Float | Q4_K_M GGUF | |||| | short gen (p=1, g=32) | 110 tok/s | 170 tok/s | | long gen (p=512, g=128) | 72 tok/s | 93 tok/s | The precision trap nobody warns you about Honestly making it work was harder than I though. Gemma 4 uses attention_scale=1.0 (QK-norm instead of the usual 1/sqrt(d_k) scaling.