Hi,
I'm using QEMU/KVM on RHEL (CentOS) 7.8.2003:
# cat /etc/redhat-release
CentOS Linux release 7.8.2003
I'm passing an NVMe drive into a Linux KVM virtual machine (<type
arch='x86_64' machine='pc-i440fx-rhel7.0.0'>hvm</type>) which has
the
following 'hostdev' entry:
<hostdev mode='subsystem' type='pci' managed='yes'>
<driver name='vfio'/>
<source>
<address domain='0x0000' bus='0x42' slot='0x00'
function='0x0'/>
</source>
<alias name='hostdev5'/>
<rom bar='off'/>
<address type='pci' domain='0x0000' bus='0x01'
slot='0x0f'
function='0x0'/>
</hostdev>
This all works fine during normal operation, but I noticed when we
remove the NVMe drive (surprise hotplug event), the PCIe EP then seems
"stuck"... here we see the link-down event on the host (when the drive
is removed):
[67720.177959] pciehp 0000:40:01.2:pcie004: Slot(238-1): Link Down
[67720.178027] vfio-pci 0000:42:00.0: Relaying device request to user (#0)
And naturally, inside of the Linux VM, we see the NVMe controller drop:
[ 1203.491536] nvme nvme1: controller is down; will reset:
CSTS=0xffffffff, PCI_STATUS=0xffff
[ 1203.522759] blk_update_request: I/O error, dev nvme1n2, sector
33554304 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ 1203.560505] nvme 0000:01:0f.0: Refused to change power state, currently in D3
[ 1203.561104] nvme nvme1: Removing after probe failure status: -19
[ 1203.583506] Buffer I/O error on dev nvme1n2, logical block 4194288,
async page read
[ 1203.583514] blk_update_request: I/O error, dev nvme1n1, sector
33554304 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
We see this EP is found at IOMMU group '76':
# readlink /sys/bus/pci/devices/0000\:42\:00.0/iommu_group
../../../../kernel/iommu_groups/76
And it is no longer bound to the 'vfio-pci' driver (expected) on the
host. I was expecting to see all of the FD's to the /dev/vfio/NN
character devices closed, but it seems they are still open:
# lsof | grep "vfio/76"
qemu-kvm 242364 qemu 70u CHR 235,4
0t0 3925324 /dev/vfio/76
qemu-kvm 242364 242502 qemu 70u CHR 235,4
0t0 3925324 /dev/vfio/76
qemu-kvm 242364 242511 qemu 70u CHR 235,4
0t0 3925324 /dev/vfio/76
qemu-kvm 242364 242518 qemu 70u CHR 235,4
0t0 3925324 /dev/vfio/76
qemu-kvm 242364 242531 qemu 70u CHR 235,4
0t0 3925324 /dev/vfio/76
qemu-kvm 242364 242533 qemu 70u CHR 235,4
0t0 3925324 /dev/vfio/76
qemu-kvm 242364 242542 qemu 70u CHR 235,4
0t0 3925324 /dev/vfio/76
qemu-kvm 242364 242550 qemu 70u CHR 235,4
0t0 3925324 /dev/vfio/76
qemu-kvm 242364 242554 qemu 70u CHR 235,4
0t0 3925324 /dev/vfio/76
SPICE 242364 242559 qemu 70u CHR 235,4
0t0 3925324 /dev/vfio/76
After the NVMe drive was removed for 100 seconds, we see the following
kernel messages on the host:
[67820.179749] vfio-pci 0000:42:00.0: Relaying device request to user (#10)
[67900.272468] vfio_bar_restore: 0000:42:00.0 reset recovery - restoring bars
[67900.272652] vfio_bar_restore: 0000:42:00.0 reset recovery - restoring bars
[67900.319284] vfio_bar_restore: 0000:42:00.0 reset recovery - restoring bars
I also noticed these messages related to the EP that is down currently
that seem to continue indefinitely on the host (every 100 seconds):
[67920.181882] vfio-pci 0000:42:00.0: Relaying device request to user (#20)
[68020.184945] vfio-pci 0000:42:00.0: Relaying device request to user (#30)
[68120.188209] vfio-pci 0000:42:00.0: Relaying device request to user (#40)
[68220.190397] vfio-pci 0000:42:00.0: Relaying device request to user (#50)
[68320.192575] vfio-pci 0000:42:00.0: Relaying device request to user (#60)
But perhaps that is expected behavior. In any case, the problem comes
when I re-insert the NVMe drive into the system... on the host, we see
the link-up event:
[68418.595101] pciehp 0000:40:01.2:pcie004: Slot(238-1): Link Up
But the device is not bound to the 'vfio-pci' driver:
# ls -ltr /sys/bus/pci/devices/0000\:42\:00.0/driver
ls: cannot access /sys/bus/pci/devices/0000:42:00.0/driver: No such
file or directory
And appears to fail when attempting to bind to it manually:
# echo "0000:42:00.0" > /sys/bus/pci/drivers/vfio-pci/bind
-bash: echo: write error: No such device
Device is enabled:
# cat /sys/bus/pci/devices/0000\:42\:00.0/enable
1
So, wondering if this is expected behavior? Stopping the VM and
starting it (virsh destroy/start) allows the device to work in the VM
again, but for my particular use case, this is not an option. Need the
surprise hotplug functionality to work with the PCIe EP passed into
the VM. And perhaps this is an issue elsewhere (eg, vfio-pci). Any
tips/suggestions on where to dig more would be appreciated.
Thanks for your time.
--Marc