Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK

TL;DR On 4× RTX 3090 with NVLink bonded between GPU pairs (0↔2 and 1↔3), pinning TP=2 to a NVLinked pair gave +25% throughput at concurrency 1 and +53% at concurrency 4 vs running TP=2 over PCIe. Adding the other two GPUs to make it TP=4 made things worse, not better. Setup Hardware: 4× RTX 3090 (24 GB), NVLink (NV4) between GPU0↔GPU2 and GPU1↔GPU3. Cross-pair traffic goes via PCIe Host Bridge (PHB). Bash $ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU0 X PHB NV4 PHB GPU1 PHB X PHB NV4 GPU2 NV4 PHB X PHB GPU3 PHB NV4 PHB X Software: vLLM 0.20.1, transformers 5.7.0, CUDA 12.8.