On Tue, Jul 27, 2021 at 11:22:25AM +0200, Francesc Guasch wrote:
Hello.
I have a host with an NVIDIA RTX 3090. I configured PCI passthrough
and it works fine. We are using it for CUA and Matlab on Ubuntu 20.04.
The problem comes sometimes on rebooting the virtual machine. It doesn't
happen 100% of the times but eventually after 3 or 4 reboots the PCI
device stops working. The only solution is to reboot the host.
Weird thing is this only happens when rebooting the VM. After a host
reboot if we shutdown the virtual machine and we start it again,
it works fine. I wrote a small script that does that a hundred times
just to make sure. Only a reboot triggers the problem.
When it fails I run "nvidia-smi" in the virtual machine and I get:
No devices were found
Also I spotted some errors in syslog
NVRM: installed in this system is not supported by the
NVIDIA 460.91.03 driver release.
NVRM: GPU 0000:01:01.0: GPU has fallen off the bus
NVRM: the NVIDIA kernel module is unloaded.
NVRM: GPU 0000:01:01.0: RmInitAdapter failed! (0x23:0x65:1204)
NVRM: GPU 0000:01:01.0: rm_init_adapter failed, device minor number 0
The device is there because typing lspci I can see information:
0000:01:01.0 VGA compatible controller [0300]: NVIDIA Corporation
Device [10de:2204] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd Device [1458:403b]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
I tried different Nvidia drivers and Linux kernels in the host and
the virtual machine with the same results.
Hi,
this question is better suited for vfio-users(a)redhat.com. Once the GPU is bound
to the vfio-pci driver, it's out of libvirt's hands.
AFAIR Nvidia only enabled PCI device assignment on GeForce cards on Windows 10
VMs, but you claim to run a Linux VM. Back when I worked on the vGPU stuff that
is supported only on the Tesla cards, I remember being told that the host and
guest driver communicated with each other. Applying the same to GeForce, I
would not be surprised if NVIDIA detected in the host driver that the
corresponding guest driver is not a Windows 10 one and didn't do a proper GPU
reset in between VM reboots - hence the need to reboot the host. There used to
be a similar bus reset bug in the AMD host driver not so long ago which
affected every single VM shutdown/reboot in a way that the host had to be
rebooted in order for the card to be usable again. Be it as it may, I can only
speculate and since your scenario is officially not supported by NVIDIA I wish
you the best of luck :)
Regards,
Erik