llama.cpp: Prefetching weights when offloading to CPU

r/LocalLLaMA
Generative AI AI Hardware Open Source AI

Hello r/LocalLLaMA, I put up an experimental PR which prefetches weights when offloading to CPU. Long story short from results it helps dense + smaller MoE models. Give it a try if you are ram-rich and gpu-poor like me. submitted by /u/am17an [link] [comments]