Multi-GPU? Check your PCI-E lanes! x570, Doubled my prompt proc. speed by switching 'primary' devices, on an asymmetrical x16 / x4 lane setup.

Short version - in my situation, adding export CUDA_VISIBLE_DEVICES="1,0" to my llama.cpp launch script doubled prompt processing speed for me in some situations. Folks, I've been running a dual 3090 setup on a system that splits the PCI-E lanes 16x / 4x between the two "x16" slots (common on x570 boards, I believe). For whatever reason, by default, at least in my setup (Ubuntu-Server 24.04 Nvidia 580.126.20 drivers, x570 board), the CUDA0 device is the one on the 4-lane PCI express slot.