Gemma 4 26B-A4B MoE running at 45-60 tok/s on DGX Spark — here's how
r/LocalLLaMA
•
AI Hardware
Open Source AI
Spent half the night on getting google/gemma-4-26B-A4B-it running fast on a single NVIDIA DGX Spark (128GB unified memory, GB10 Blackwell). Some things I learned that might save others time: NVFP4 quantization The 26B MoE model is ~49GB in BF16 - runs but slowly. NVFP4 brings it down to 16.5GB with 3x compression. The catch: Google s MoE expert weights as fused 3D tensors that no existing quantization tool handles. NVIDIA's modelopt silently skips them (91% of the model!). I wrote a custom plugin that unfuses the experts into individual layers, quantizes them, then re-exports.