Qwen 27b and Other Dense Models Optimization

Hi All, I hadn't realized the k cache quant made such a big difference, so I took my 64 gig mac M2 Max Studio and switched from Qwen 3.5 35b a3b to the dense 27b. I love it, it's a huge difference, but I get maybe 3 tokens a second. I have k cache at q8, offload to gpu, flash attention, mmap, max concurrent 4, eval batch 2048, cpu set to 8, gpu offload full (64). I'm on LM Studios and run everything through Openclaw. Just wondering if there's anything I can do to speed it up.