Fix: Dual Intel Arc GPUs using all system RAM during inference - found the cause and a working fix (llama.cpp SYCL)

r/LocalLLaMA
Generative AI AI Hardware Open Source AI

If you're running dual Intel Arc GPUs with llama.cpp and your system RAM maxes out during multi-GPU inference, even though the model fits in VRAM, this post explains why and how to fix it. I've been running dual Arc Pro B70s (32GB each, 64GB total VRAM) for local LLM inference with llama.cpp's SYCL backend. Every time I tried to split a model across both GPUs, my 64GB of system RAM would climb to 100% and the OOM killer would start taking out desktop processes until the system either crashed or dumped me at the login screen. This happened with every model size.