I’ve been experimenting with MoE inference bottlenecks in llama.cpp, specifically expert movement over PCIe when the model doesn’t fit in VRAM. I implemented a small prototype (~500 LoC) that adds a GPU expert cache + predictive prefetching.

r/LocalLLaMA
Generative AI AI Hardware Open Source AI

Qwen3-30B-A3B (Q4_K_M) - 4070 Ti 12GB 33.74 → 64.45 tok/s (1.91×), 99.5% hit rate I've been working on a llama.cpp fork called FATE: a small (~500 lines of C++/CUDA) extension that adds a GPU-resident expert cache with cross-layer + temporal predictive prefetching for sparse MoE models. The idea is simple: In MoE inference, most of the model sits in expert FFN weights, but only a small subset of experts is active for each token. If the model is larger than VRAM, vanilla offloading keeps pulling those expert weights from system RAM over PCIe again and again.