[llama.cpp] 3.1x Q8_0 speedup on Intel Arc GPUs - reorder optimization fix (PR submitted)

r/LocalLLaMA
Generative AI Open Source AI AI Research

TL;DR: Q8_0 quantization on Intel Xe2 (Battlemage/Arc B-series) GPUs was achieving only 21% of theoretical memory bandwidth. My AI Agent and I found the root cause and submitted a fix that brings it to 66% - a 3.1x speedup in token generation. The problem: On Intel Arc Pro B70, Q8_0 models ran at 4.88 t/s while Q4_K_M ran at 20.56 t/s; a 4x gap that shouldn't exist since Q8_0 only has 1.7x data. After ruling out VRAM pressure, drivers, and backend issues, we traced it to the SYCL kernel dispatch path.