
On Tue, Jul 27, 2021 at 11:22:25AM +0200, Francesc Guasch wrote:
Hello.
I have a host with an NVIDIA RTX 3090. I configured PCI passthrough and it works fine. We are using it for CUA and Matlab on Ubuntu 20.04.
The problem comes sometimes on rebooting the virtual machine. It doesn't happen 100% of the times but eventually after 3 or 4 reboots the PCI device stops working. The only solution is to reboot the host.
Weird thing is this only happens when rebooting the VM. After a host reboot if we shutdown the virtual machine and we start it again, it works fine. I wrote a small script that does that a hundred times just to make sure. Only a reboot triggers the problem.
When it fails I run "nvidia-smi" in the virtual machine and I get:
No devices were found
Also I spotted some errors in syslog
NVRM: installed in this system is not supported by the NVIDIA 460.91.03 driver release. NVRM: GPU 0000:01:01.0: GPU has fallen off the bus NVRM: the NVIDIA kernel module is unloaded. NVRM: GPU 0000:01:01.0: RmInitAdapter failed! (0x23:0x65:1204) NVRM: GPU 0000:01:01.0: rm_init_adapter failed, device minor number 0
The device is there because typing lspci I can see information:
0000:01:01.0 VGA compatible controller [0300]: NVIDIA Corporation Device [10de:2204] (rev a1) Subsystem: Gigabyte Technology Co., Ltd Device [1458:403b] Kernel driver in use: nvidia Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
I tried different Nvidia drivers and Linux kernels in the host and the virtual machine with the same results.
Hi, this question is better suited for vfio-users@redhat.com. Once the GPU is bound to the vfio-pci driver, it's out of libvirt's hands. AFAIR Nvidia only enabled PCI device assignment on GeForce cards on Windows 10 VMs, but you claim to run a Linux VM. Back when I worked on the vGPU stuff that is supported only on the Tesla cards, I remember being told that the host and guest driver communicated with each other. Applying the same to GeForce, I would not be surprised if NVIDIA detected in the host driver that the corresponding guest driver is not a Windows 10 one and didn't do a proper GPU reset in between VM reboots - hence the need to reboot the host. There used to be a similar bus reset bug in the AMD host driver not so long ago which affected every single VM shutdown/reboot in a way that the host had to be rebooted in order for the card to be usable again. Be it as it may, I can only speculate and since your scenario is officially not supported by NVIDIA I wish you the best of luck :) Regards, Erik