Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster

I have been running some benchmarks on a heterogeneous 7-GPU cluster to see how different inference engines handle long context prefill using pipeline parallelism. My setup consists of a mix of Blackwell and Ada cards: one RTX PRO 6000 96GB, one PRO 5000 48GB, two 5090 32GB, and three modded 4090 48GB. All tests were done using 4-bit weights, specifically NVFP4 for vLLM and SGLang, and MXFP4 for llama.cpp. The main takeaway is that vLLM significantly outperforms the others on mixed multi-GPU setups for long context prefill.