Autoresearch on Qwen3.5-397B, 36 experiments to reach 20.34 tok/s on M5 Max, honest results

I spent the past week trying to push Qwen3.5-397B faster on my M5 Max 128GB. Dan Woods' original baseline was 4.36 tok/s on M3 Max. On M5 Max the starting point was already 10.61 tok/s due to better hardware. My optimizations pushed it to 20.34 tok/s, roughly 2x through software alone, and 4.67x over Dan's original result. Hardware: MacBook Pro M5 Max, 128GB unified memory, 40-Core GPU Model config: Qwen3.5-397B-A17B, Q3-GGUF experts (Unsloth IQ3_XXS/IQ4_XS mixed precision), Q8_0 embedding, Q6_K LM head. Decode: 20.34 tok/s. Prefill: 5.52 tok/s.