Vulkan backend much easier on the CPU and GPU memory than CUDA.
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
On linux and compiled my own llama.cpp with CUDA, top would always show one pegged CPU core at 100% when running Qwen3.5-9B-GGUF:Q4_K_M on my potato like RTX A2000 12GB. Also, nvidia-smi would show 11GB+ of GPU memory usage. Speed is ~30 tokens per second. My system fans would spin up when this single core gets pegged which was annoying to listen to. Decided to compile llama.cpp again with Vulkan backend to see if anything would be different.