Ran the same models across Strix Halo, RTX 3090, and RTX 5070 because I wanted my own numbers
r/LocalLLaMA
•
Generative AI
AI Hardware
AI Research
I kept seeing inference-speed claims for these models and wanting an apples-to-apples comparison on the hardware I actually have. So I built a harness and a public page that dumps every run as YAML. The dataset: 55 runs, three rigs, five backends (rocm, vulkan, cpu, cuda, __TECH_PRESERVE_6TECH_PRESERVE_5__), models from 0.35B (LFM2.5) through 35B-A3B (Qwen3.5 MoE). Workloads: short-prompt chat, long-context RAG, codegen long-output, and an agent shape at concurrency 1 and 4. Three measured iterations after one warmup, temperature 0, VRAM-fit verified before each run.