Faster MoE Inference with Warp Decode (16 minute read)

TLDR AI
AI Research

Cursor's “warp decode” is a kernel design that reorganizes MoE inference around output neurons instead of experts. It achieves ~1.8x higher throughput and improved numerical accuracy on Blackwell GPUs.