Passthrough PCI GPU device fails on reboot

Hello. I have a host with an NVIDIA RTX 3090. I configured PCI passthrough and it works fine. We are using it for CUA and Matlab on Ubuntu 20.04. The problem comes sometimes on rebooting the virtual machine. It doesn't happen 100% of the times but eventually after 3 or 4 reboots the PCI device stops working. The only solution is to reboot the host. Weird thing is this only happens when rebooting the VM. After a host reboot if we shutdown the virtual machine and we start it again, it works fine. I wrote a small script that does that a hundred times just to make sure. Only a reboot triggers the problem. When it fails I run "nvidia-smi" in the virtual machine and I get: No devices were found Also I spotted some errors in syslog NVRM: installed in this system is not supported by the NVIDIA 460.91.03 driver release. NVRM: GPU 0000:01:01.0: GPU has fallen off the bus NVRM: the NVIDIA kernel module is unloaded. NVRM: GPU 0000:01:01.0: RmInitAdapter failed! (0x23:0x65:1204) NVRM: GPU 0000:01:01.0: rm_init_adapter failed, device minor number 0 The device is there because typing lspci I can see information: 0000:01:01.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2204] (rev a1) Subsystem: Gigabyte Technology Co., Ltd Device [1458:403b] Kernel driver in use: nvidia Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia I tried different Nvidia drivers and Linux kernels in the host and the virtual machine with the same results. I wonder if some process is keeping hold of the PCI device but every thing I tried failed. I made sure the virtual machines are down. Then I start it again. I also tried restarting libvirtd. I was thinking about a hardware problem but we 3 different GPUs of the same model and it happens with all of them. I know this is hard but I wonder if anyone here has any idea of what can I do to fix this. thank you for your time

On Tue, Jul 27, 2021 at 11:22:25AM +0200, Francesc Guasch wrote:
Hello.
I have a host with an NVIDIA RTX 3090. I configured PCI passthrough and it works fine. We are using it for CUA and Matlab on Ubuntu 20.04.
The problem comes sometimes on rebooting the virtual machine. It doesn't happen 100% of the times but eventually after 3 or 4 reboots the PCI device stops working. The only solution is to reboot the host.
Weird thing is this only happens when rebooting the VM. After a host reboot if we shutdown the virtual machine and we start it again, it works fine. I wrote a small script that does that a hundred times just to make sure. Only a reboot triggers the problem.
When it fails I run "nvidia-smi" in the virtual machine and I get:
No devices were found
Also I spotted some errors in syslog
NVRM: installed in this system is not supported by the NVIDIA 460.91.03 driver release. NVRM: GPU 0000:01:01.0: GPU has fallen off the bus NVRM: the NVIDIA kernel module is unloaded. NVRM: GPU 0000:01:01.0: RmInitAdapter failed! (0x23:0x65:1204) NVRM: GPU 0000:01:01.0: rm_init_adapter failed, device minor number 0
The device is there because typing lspci I can see information:
0000:01:01.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2204] (rev a1) Subsystem: Gigabyte Technology Co., Ltd Device [1458:403b] Kernel driver in use: nvidia Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
I tried different Nvidia drivers and Linux kernels in the host and the virtual machine with the same results.
Hi, this question is better suited for vfio-users@redhat.com. Once the GPU is bound to the vfio-pci driver, it's out of libvirt's hands. AFAIR Nvidia only enabled PCI device assignment on GeForce cards on Windows 10 VMs, but you claim to run a Linux VM. Back when I worked on the vGPU stuff that is supported only on the Tesla cards, I remember being told that the host and guest driver communicated with each other. Applying the same to GeForce, I would not be surprised if NVIDIA detected in the host driver that the corresponding guest driver is not a Windows 10 one and didn't do a proper GPU reset in between VM reboots - hence the need to reboot the host. There used to be a similar bus reset bug in the AMD host driver not so long ago which affected every single VM shutdown/reboot in a way that the host had to be rebooted in order for the card to be usable again. Be it as it may, I can only speculate and since your scenario is officially not supported by NVIDIA I wish you the best of luck :) Regards, Erik

El 27/7/21 a les 12:29, Erik Skultety ha escrit:
On Tue, Jul 27, 2021 at 11:22:25AM +0200, Francesc Guasch wrote:
Hello.
I have a host with an NVIDIA RTX 3090. I configured PCI passthrough and it works fine. We are using it for CUA and Matlab on Ubuntu 20.04.
The problem comes sometimes on rebooting the virtual machine. It doesn't happen 100% of the times but eventually after 3 or 4 reboots the PCI device stops working. The only solution is to reboot the host.
While reviewing the system I noticed that, though I had run dist-upgrade, the host didn't had the latest kernel. Upgrading from kernel 5.8 fixed the problem. So it seems it was a vfio problem as Erik pointed. Thanks !
participants (2)
-
Erik Skultety
-
Francesc Guasch