I just ran Qwen3.5 35B on my iPhone at 5.6 tok/sec.
r/LocalLLaMA
•
AI Hardware
AI Research
Fully on-device at 4bit with 256 experts. It uses SSD streaming to the GPU of the experts in MoE models. I saw the article from Dan Woods and decided to port the metal inference engine to ios, add a few optimization and build a basic app. I'm currently generating the weights for the 379B model and will have that running next. submitted by /u/Alexintosh [link] [comments]