55 → 282 tok/s: How I got Qwen3.5-397B running at speed on 4x RTX PRO 6000 Blackwell

r/LocalLLaMA
AI Hardware Open Source AI AI Research

TL;DR: Built a custom CUTLASS kernel to fix SM120's broken MoE GEMM tiles. Went from 55 tok/s (WSL2) → 119 (native Linux) → 142 (driver/config optimization) → 282 tok/s (custom K=64 kernel). PR submitted to FlashInfer, pre-built Docker image available. The Problem If you're running NVFP4 MoE models (Qwen3.5-397B, DeepSeek, etc.) on RTX PRO 6000, RTX 5090, or DGX Spark - basically any SM120 Blackwell workstation GPU - you've probably seen this: Failed to initialize cutlass TMA WS grouped gemm The autotuner skips all the SM120 GEMM tiles because they overflow your GPU's 99KB shared memory.