Consolidated my homelab from 3 models down to one 122B MoE — benchmarked everything, here's what I found
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
Been running local LLMs on a Strix Halo setup (Ryzen AI MAX+ 395, 128GB RAM, 96 GiB shared GPU memory via Vulkan/RADV) under Proxmox with LXC containers and llama-server. Wanted to share where I landed after way too much benchmarking. THE OLD SETUP (3 text models) - GLM-4.7-Flash: 30B MoE 3B active, 18GB, 72 tok/s - daily driver, email - Qwen3.5-35B-A3B: 35B MoE 3B active, 20GB, 55 tok/s - reasoning/coding - Qwen3-VL-8B: 8B dense, 6GB, 39 tok/s - vision/cameras ~44GB total. Worked but routing 3 models was annoying.