[Benchmark] Dual RTX 5090 Distributed Inference via llama.cpp RPC - Running 122B MoE at 96 t/s over 2.5GbE

Model Size Single 5090 (t/s) Dual 5090 RPC (t/s) Note Qwen3.5-27B (Q6_K) 20.9 GB 59.83 55.41 -7% Overhead Qwen3.5-35B MoE (Q6_K) 26.8 GB 206.76 150.99 Interconnect Bottleneck Qwen2.5-32B (Q6_K) 25.0 GB 54.69 51.47 Stable Scaling Qwen2.5-72B (Q4_K_M) 40.9 GB FAILED (OOM) 32.74 Now Playable! Qwen3.5-122B MoE (IQ4_XS) 56.1 GB FAILED (OOM) 96.29 Beast Mode ON The Setup I recently tested the distributed inference capabilities of llama.cpp RPC using two identical workstations. This setup allows pooling VRAM (64GB total) to run models that are physically impossible to fit on a single 32GB card.