Token/s Qwen3.5-397B-A17B on Vram + Ram pooled
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
Anyone running Qwen3.5-397B-A17B on a pooled VRAM+RAM setup? What hardware and what speeds are you getting? Trying to get a realistic picture of what this model actually does on a hybrid GPU+system RAM configuration via llama.cpp MoE offloading. Unsloth’s docs claim 25+ tok/s on a single 24GB GPU + 256GB system RAM, but there’s zero info on what CPU or RAM speed that was measured on - which matters a lot since the bottleneck shifts almost entirely to CPU to RAM bandwidth when most of the 214GB Q4 model is sitting in system.