Workaround for NVFP4 MOE on Rtx 5090/Pro 6000 (SM 12): --moe-bakcend marlin gives 47 tok/s on Qwen 3.5 397b
r/LocalLLaMA
•
AI Hardware
Open Source AI
AI Tools
# Workaround: NVFP4 MoE Models on SM120 (RTX PRO 6000 / RTX 5090 Blackwell) -- Use `--moe-backend marlin` ## TL;DR NVFP4 MoE models produce **garbage output** on SM120 GPUs due to broken CUTLASS `mm_fp4` GEMM kernels. The fix: use vLLM's Marlin MoE backend (`--moe-backend marlin`) with `--tensor-parallel-size 2 --pipeline-parallel-size 2`. This bypasses the broken kernel entirely by dequantizing FP4 weights to FP16 before computation (W4A16 path), preserving memory bandwidth savings while producing correct output.