Hello.
I have a host with an NVIDIA RTX 3090. I configured PCI passthrough
and it works fine. We are using it for CUA and Matlab on Ubuntu 20.04.
The problem comes sometimes on rebooting the virtual machine. It doesn't
happen 100% of the times but eventually after 3 or 4 reboots the PCI
device stops working. The only solution is to reboot the host.
Weird thing is this only happens when rebooting the VM. After a host
reboot if we shutdown the virtual machine and we start it again,
it works fine. I wrote a small script that does that a hundred times
just to make sure. Only a reboot triggers the problem.
When it fails I run "nvidia-smi" in the virtual machine and I get:
No devices were found
Also I spotted some errors in syslog
NVRM: installed in this system is not supported by the
NVIDIA 460.91.03 driver release.
NVRM: GPU 0000:01:01.0: GPU has fallen off the bus
NVRM: the NVIDIA kernel module is unloaded.
NVRM: GPU 0000:01:01.0: RmInitAdapter failed! (0x23:0x65:1204)
NVRM: GPU 0000:01:01.0: rm_init_adapter failed, device minor number 0
The device is there because typing lspci I can see information:
0000:01:01.0 VGA compatible controller [0300]: NVIDIA Corporation
Device [10de:2204] (rev a1)
Subsystem: Gigabyte Technology Co., Ltd Device [1458:403b]
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
I tried different Nvidia drivers and Linux kernels in the host and
the virtual machine with the same results.
I wonder if some process is keeping hold of the PCI device but every
thing I tried failed. I made sure the virtual machines are down. Then
I start it again. I also tried restarting libvirtd.
I was thinking about a hardware problem but we 3 different GPUs
of the same model and it happens with all of them.
I know this is hard but I wonder if anyone here has any idea of
what can I do to fix this.
thank you for your time