Update on Qwen 3.5 35B A3B on Raspberry PI 5

r/LocalLLaMA
Generative AI Open Source AI AI Research

Did some work on my Raspberry Pi inference setup. Modified llama.cpp (a mix of the OG repo, ik_llama, and some tweaks) Experimented with different quants, params, etc. Prompt caching (ik_llama has some issues on ARM, so it’s not 100% tweaked yet, but I’m getting there) The above is running this specific quant: Some numbers for what to expect now (all tests on 16k context, vision encoder enabled): 2-bit big-ish quants of Qwen3.5 35B A3B: 3.5 t/s on the 16GB Pi, 2.5-ish t/s on the SSD-enabled 8GB Pi. Prompt processing is around ~50s per 1k tokens.