(Llama.cpp) In case people are struggling with prompt processing on larger models like Qwen 27B, here's what helped me out
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
TLDR: I put the --ubatch-size to my GPU's L3 cache is (in MB). I was playing around with that value, and I had a hard time finding what exactly it did, or rather, I couldn't really understand it from most of the sources, and asking AI chats for help yielded very mixed results. My GPU is 9070xt, and when I put it to --ubatch-size 64 (as the GPU has 64MB of L3 cache) my prompt processing jumped in speed where it was actually usable for Claude code invocation. I understand there might well be some resources detailing and explaining this on the web, or in the docs.