Running Qwen3.5-35B-A3B and Nemotron-3-Super-120B-A12B on a 5060ti and 1080ti with llama.cpp (Fully on GPU for Qwen; 64GB RAM needed for Nemotron)

r/LocalLLaMA
Generative AI AI Hardware Open Source AI

Setup: CPU: AMD Ryzen 5 9600X RAM: 64GB DDR5 GPU1 (host): RTX 5060ti 16GB GPU2 (VM passthrough → RPC): GTX 1080ti 11GB OS: Ubuntu 24.04 Exact models: unsloth/Qwen3.5-35B-A3B-GGUF The Q4_K_M quant here unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF The UD-Q4_K_M quant here tl;dr with my setup: Qwen3.5-35B-A3B Q4_K_M runs at 60tok/sec Nemotron-3-Super-120B-A12B UD-Q4_K_M runs at 3tok/sec I've had a GTX 1080ti for years and years and finally hit a wall with models that require newer non-Pascal architecture, so I decided to upgrade to a 5060ti. I went to install the card when I thought.