Nemotron 3 Super - large quality difference between llama.cpp and vLLM?

Hey all, I have a private knowledge/reasoning benchmark I like to use for evaluating models. It's a bit over 400 questions, intended for non-thinking modes, programatically scored. It seems to correlate quite well with the model's quality, at least for my usecases. Smaller models (24-32B) tend to score ~40%, larger ones (70B dense or somewhat larger MoEs) often score ~50%, and the largest ones I can run (Devstral 2/low quants of GLM 4.5-7) get up to ~60%. On launch of Nemotron 3 Super it seemed llama.cpp was not instantly there, so I thought I'd try vLLM to run the NVFP4 version.