Gemma-4-31B NVFP4 inference numbers on 1x RTX Pro 6000

r/LocalLLaMA
AI Hardware Open Source AI AI Research AI Tools

Ran a quick inference sweep on gemma 4 31B in NVFP4 (using nvidia/Gemma-4-31B-IT-NVFP4 ). The NVFP4 checkpoint is 32GB, half of the BF16 size from google (63GB), likely a mix of BF16 and FP4 roughly equal to FP8 in size. This model uses a ton of VRAM for k cache. I dropped the k cache precision to FP8. All numbers are steady-state averages under sustained load using locust and numbers below are per-user metrics to show user interactivity. 1K output. vLLM.