Why DDR5 Bandwidth Kills Dual-LLM Inference on APUs (Benchmarks Inside)

Dev.to AI
Generative AI

Did you know that a 35-billion-parameter model can generate tokens at the same compute cost as a 4B model? That single fact made me abandon a multi-model agent architecture I'd spent a weekend building. But I had to run the benchmarks first to understand why. Here's the full breakdown, with commands, numbers, and the architectural reason it all falls apart on shared-memory hardware. The Discovery That Changed Everything I'd been running qwen3.6:35b on my Minisforum UM790Pro for weeks as my daily coding assistant. 17.8 tokens/second -- genuinely usable for interactive work.