Drastically improve prompt processing speed for --n-cpu-moe partially offloaded models

r/LocalLLaMA
Machine Learning Generative AI Open Source AI

Bigger ubatch made gpt-oss-120b prompt processing much faster on my RTX 3090 I was tuning gpt-oss-120b-F16.gguf with llama.cpp on a 24 GB RTX 3090 and found that increasing the physical micro-batch size ( -ub ) can massively improve prompt processing throughput, as long as you also raise --n-cpu-moe enough to keep the run inside VRAM. The llama.cpp defaults are -b 2048 and -ub 512; I included that default run as its own point in the chart.