Anyone else fighting Blackwell GSP timeout in production passthrough? How are you handling recovery without a host reboot?

r/LocalLLaMA
AI Hardware

Environment: GPU: NVIDIA RTX Pro 5000 (Blackwell Architecture, PCI ID: 10de:2bb3) Host OS: Linux (KVM/QEMU Hypervisor) Guest OS: Ubuntu 24.04 LTS Driver Version: 580.105.08 (Open Kernel Module / MIT-GPL Flavor) The problem: When passing through the RTX Pro 5000 (Blackwell) to an Ubuntu VM via VFIO, the GSP firmware occasionally hits a heartbeat timeout during initialization or driver reload.