Slow tok/s when offloading NVFP4 model to CPU
r/LocalLLaMA
•
AI Hardware
Title. I was messing around with Qwen3.6 35B A3B Q4_K_XL on my RTX 5070, and I got around 50 tok/s. I then realized I could be leveraging NVFP4 on my Blackwell GPU, but I tried it and it barely reached 14tok/s. The model doesn't fit on VRAM, so I had to offload some layers to the CPU. I am guessing NVFP4 is only fast when the model fits entirely on the GPU? If so, I'll have to wait for a decent model that fits in 12GB VRAM 😅 LMK if you've had a similar experience or I screwed up something else. submitted by /u/6c5d1129 [link] [comments.