[PATCH v2] util: basic support for VFIO variant drivers

Before a PCI device can be assigned to a guest with VFIO, that device must be bound to the vfio-pci driver rather than to the device's normal driver. The vfio-pci driver provides APIs that permit QEMU to perform all the necessary operations to make the device accessible to the guest. There has been kernel work recently to support vendor/device-specific VFIO variant drivers that provide the basic vfio-pci driver functionality while adding support for device-specific operations (for example these device-specific drivers are planned to support live migration of certain devices). All that will be needed to make this functionality available will be to bind the new vendor-specific driver to the device (rather than the generic vfio-pci driver, which will continue to work just without the extra functionality). But until now libvirt has required that all PCI devices being assigned to a guest with VFIO specifically have the "vfio-pci" driver bound to the device. So even if the user manually binds a shiny new vendor-specific vfio variant driver to the device (and puts "managed='no'" in the config to prevent libvirt from changing the binding), libvirt will just fail during startup of the guest (or during hotplug) because the driver bound to the device isn't exactly "vfio-pci". This patch loosens that restriction a bit - rather than requiring that the device be bound to "vfio-pci", it also checks if the drivername contains the string "vfio" at all, and in this case allows the operation to continue. If the driver is in fact a VFIO variant, then the assignment will succeed, but if it is not a VFIO variant then QEMU will fail (and report the error back to libvirt). In the near future (possibly by kernel 6.0) there will be a formal method of identifying a VFIO variant driver by looking in sysfs; in the meantime the inexact, but simple, method in this patch will allow users of the few existing VFIO variant drivers (and developers of new VFIO variant drivers) to use their new drivers without needing to remove libvirt from their setup - they can simply pre-bind the device to the new driver, then use "managed='no'" in their libvirt config. NB: this patch does *not* handle automatically determining the proper vendor-specific driver and binding to it in the case of "managed='yes'". This will be implemented later when there is a widely available driver / device combo we can use for testing. Signed-off-by: Laine Stump <laine@redhat.com> --- V1 here: https://listman.redhat.com/archives/libvir-list/2022-August/233327.html Change in V2: V1 used the output of modinfo to look for "vfio_pci" as an alias for a driver to see if it was a VFIO variant driver. As a result of discussion of V1, V2 is much simpler - it just assumes that any driver with "vfio" in the name is a VFIO variant. This is okay because 1) QEMU will still catch it and libvirt will properly log the error if the driver isn't actually a VFIO variant, and 2) it's a temporary situation, just to enable use of VFIO variant drivers with libvirt until a standard method of detecting this is added to sysfs (which, according to the discussion of V1, is coming in the near future). (NB: I did implement checking of /lib/modules/`uname -r`/modules.alias as suggested by Erik, but it turned out that this caused the unit tests to call uname(3) and open the modules.alias file on the test host - for a proper unit test I would have also needed to mock these two functions, and it seemed like too much complexity for a temporary workaround. I've implemented Jason's suggestion here (accept any driver with "vfio" in the name), which is similar to danpb's suggestion (accept specifically the two drivers that are already in the upstream kernel), but will also allow for new drivers that may be under development.) src/hypervisor/virhostdev.c | 26 ++++--------- src/util/virpci.c | 76 ++++++++++++++++++++++++++++++++++--- src/util/virpci.h | 3 ++ 3 files changed, 82 insertions(+), 23 deletions(-) diff --git a/src/hypervisor/virhostdev.c b/src/hypervisor/virhostdev.c index c0ce867596..15b35fa75e 100644 --- a/src/hypervisor/virhostdev.c +++ b/src/hypervisor/virhostdev.c @@ -747,9 +747,8 @@ virHostdevPreparePCIDevicesImpl(virHostdevManager *mgr, mgr->inactivePCIHostdevs) < 0) goto reattachdevs; } else { - g_autofree char *driverPath = NULL; - g_autofree char *driverName = NULL; - int stub; + g_autofree char *drvName = NULL; + virPCIStubDriver drvType; /* Unmanaged devices should already have been marked as * inactive: if that's the case, we can simply move on */ @@ -769,18 +768,14 @@ virHostdevPreparePCIDevicesImpl(virHostdevManager *mgr, * information about active / inactive device across * daemon restarts has been implemented */ - if (virPCIDeviceGetDriverPathAndName(pci, - &driverPath, &driverName) < 0) + if (virPCIDeviceGetDriverNameAndType(pci, &drvName, &drvType) < 0) goto reattachdevs; - stub = virPCIStubDriverTypeFromString(driverName); - - if (stub > VIR_PCI_STUB_DRIVER_NONE && - stub < VIR_PCI_STUB_DRIVER_LAST) { + if (drvType > VIR_PCI_STUB_DRIVER_NONE) { /* The device is bound to a known stub driver: store this * information and add a copy to the inactive list */ - virPCIDeviceSetStubDriver(pci, stub); + virPCIDeviceSetStubDriver(pci, drvType); VIR_DEBUG("Adding PCI device %s to inactive list", virPCIDeviceGetName(pci)); @@ -2292,18 +2287,13 @@ virHostdevPrepareOneNVMeDevice(virHostdevManager *hostdev_mgr, /* Let's check if all PCI devices are NVMe disks. */ for (i = 0; i < virPCIDeviceListCount(pciDevices); i++) { virPCIDevice *pci = virPCIDeviceListGet(pciDevices, i); - g_autofree char *drvPath = NULL; g_autofree char *drvName = NULL; - int stub = VIR_PCI_STUB_DRIVER_NONE; + virPCIStubDriver drvType; - if (virPCIDeviceGetDriverPathAndName(pci, &drvPath, &drvName) < 0) + if (virPCIDeviceGetDriverNameAndType(pci, &drvName, &drvType) < 0) goto cleanup; - if (drvName) - stub = virPCIStubDriverTypeFromString(drvName); - - if (stub == VIR_PCI_STUB_DRIVER_VFIO || - STREQ_NULLABLE(drvName, "nvme")) + if (drvType == VIR_PCI_STUB_DRIVER_VFIO || STREQ_NULLABLE(drvName, "nvme")) continue; VIR_WARN("Suspicious NVMe disk assignment. PCI device " diff --git a/src/util/virpci.c b/src/util/virpci.c index 7800966963..51ccf4d9fd 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -277,6 +277,71 @@ virPCIDeviceGetDriverPathAndName(virPCIDevice *dev, char **path, char **name) } +/** + * virPCIDeviceGetDriverNameAndType: + * @dev: virPCIDevice object to examine + * @drvName: returns name of driver bound to this device (if any) + * @drvType: returns type of driver if it is a known stub driver type + * + * Find the name of the driver bound to @dev (if any) and the type of + * the driver if it is a known/recognized "stub" driver (based on the + * driver name). + * + * There are vfio "variant" drivers that provide all the basic + * functionality of the standard vfio-pci driver as well as additional + * stuff. There is a plan to add info to sysfs that will allow easily + * determining if a driver is a vfio variant driver, but that sysfs + * entry isn't yet available. In the meantime as a workaround so that + * the few existing vfio variant drivers can be used with libvirt, and + * so that driver developers can test their new vfio variant drivers + * without needing to bypass libvirt, we also check if the driver name + * contains the string "vfio"; if it does, then we consider this drier + * as type VFIO. This can lead to false positives, but that isn't a + * horrible thing, because the problem will still be caught by QEMU as + * soon as libvirt makes the request to attach the device. + * + * Return 0 on success, -1 on failure. If -1 is returned, then an error + * message has been logged. + */ +int +virPCIDeviceGetDriverNameAndType(virPCIDevice *dev, + char **drvName, + virPCIStubDriver *drvType) +{ + g_autofree char *drvPath = NULL; + int tmpType; + + if (virPCIDeviceGetDriverPathAndName(dev, &drvPath, drvName) < 0) + return -1; + + if (!*drvName) { + *drvType = VIR_PCI_STUB_DRIVER_NONE; + return 0; + } + + tmpType = virPCIStubDriverTypeFromString(*drvName); + + if (tmpType > VIR_PCI_STUB_DRIVER_NONE) { + *drvType = tmpType; + return 0; /* exact match of a known driver name (or no name) */ + } + + /* Check if the drivername contains "vfio" and count as a VFIO + * driver if so - see above for explanation. + */ + + if (strstr(*drvName, "vfio")) { + VIR_DEBUG("Driver %s is a vfio_pci driver", *drvName); + *drvType = VIR_PCI_STUB_DRIVER_VFIO; + } else { + VIR_DEBUG("Driver %s is NOT a vfio_pci driver", *drvName); + *drvType = VIR_PCI_STUB_DRIVER_NONE; + } + + return 0; +} + + static int virPCIDeviceConfigOpenInternal(virPCIDevice *dev, bool readonly, bool fatal) { @@ -1004,8 +1069,8 @@ virPCIDeviceReset(virPCIDevice *dev, virPCIDeviceList *activeDevs, virPCIDeviceList *inactiveDevs) { - g_autofree char *drvPath = NULL; g_autofree char *drvName = NULL; + virPCIStubDriver drvType; int ret = -1; int fd = -1; int hdrType = -1; @@ -1032,15 +1097,16 @@ virPCIDeviceReset(virPCIDevice *dev, * reset it whenever appropriate, so doing it ourselves would just * be redundant. */ - if (virPCIDeviceGetDriverPathAndName(dev, &drvPath, &drvName) < 0) + if (virPCIDeviceGetDriverNameAndType(dev, &drvName, &drvType) < 0) goto cleanup; - if (virPCIStubDriverTypeFromString(drvName) == VIR_PCI_STUB_DRIVER_VFIO) { - VIR_DEBUG("Device %s is bound to vfio-pci - skip reset", - dev->name); + if (drvType == VIR_PCI_STUB_DRIVER_VFIO) { + + VIR_DEBUG("Device %s is bound to %s - skip reset", dev->name, drvName); ret = 0; goto cleanup; } + VIR_DEBUG("Resetting device %s", dev->name); if ((fd = virPCIDeviceConfigOpenWrite(dev)) < 0) diff --git a/src/util/virpci.h b/src/util/virpci.h index 4d9193f24e..0532b90f90 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -280,6 +280,9 @@ int virPCIDeviceRebind(virPCIDevice *dev); int virPCIDeviceGetDriverPathAndName(virPCIDevice *dev, char **path, char **name); +int virPCIDeviceGetDriverNameAndType(virPCIDevice *dev, + char **drvName, + virPCIStubDriver *drvType); int virPCIDeviceIsPCIExpress(virPCIDevice *dev); int virPCIDeviceHasPCIExpressLink(virPCIDevice *dev); -- 2.37.1

ping. I have a different version of this patch where I do read the modules.alias file rather than just checking the name of the driver, but that also requires "double mocking" open() in the unit test, which wasn't working properly, and I'd rather not spend the time figuring it out if it's not going to be needed. (Alex prefers that version because it is more correct than just checking the name, and he's concerned that the new sysfs-based API may take longer than we're thinking to get into downstream distros, but the version in this patch does satisfy both Jason and Daniel's suggested implementations). Anyway, I can post the other patch if anyone is interested. On 8/11/22 1:00 AM, Laine Stump wrote:
Before a PCI device can be assigned to a guest with VFIO, that device must be bound to the vfio-pci driver rather than to the device's normal driver. The vfio-pci driver provides APIs that permit QEMU to perform all the necessary operations to make the device accessible to the guest.
There has been kernel work recently to support vendor/device-specific VFIO variant drivers that provide the basic vfio-pci driver functionality while adding support for device-specific operations (for example these device-specific drivers are planned to support live migration of certain devices). All that will be needed to make this functionality available will be to bind the new vendor-specific driver to the device (rather than the generic vfio-pci driver, which will continue to work just without the extra functionality).
But until now libvirt has required that all PCI devices being assigned to a guest with VFIO specifically have the "vfio-pci" driver bound to the device. So even if the user manually binds a shiny new vendor-specific vfio variant driver to the device (and puts "managed='no'" in the config to prevent libvirt from changing the binding), libvirt will just fail during startup of the guest (or during hotplug) because the driver bound to the device isn't exactly "vfio-pci".
This patch loosens that restriction a bit - rather than requiring that the device be bound to "vfio-pci", it also checks if the drivername contains the string "vfio" at all, and in this case allows the operation to continue. If the driver is in fact a VFIO variant, then the assignment will succeed, but if it is not a VFIO variant then QEMU will fail (and report the error back to libvirt).
In the near future (possibly by kernel 6.0) there will be a formal method of identifying a VFIO variant driver by looking in sysfs; in the meantime the inexact, but simple, method in this patch will allow users of the few existing VFIO variant drivers (and developers of new VFIO variant drivers) to use their new drivers without needing to remove libvirt from their setup - they can simply pre-bind the device to the new driver, then use "managed='no'" in their libvirt config.
NB: this patch does *not* handle automatically determining the proper vendor-specific driver and binding to it in the case of "managed='yes'". This will be implemented later when there is a widely available driver / device combo we can use for testing.
Signed-off-by: Laine Stump <laine@redhat.com> --- V1 here: https://listman.redhat.com/archives/libvir-list/2022-August/233327.html
Change in V2:
V1 used the output of modinfo to look for "vfio_pci" as an alias for a driver to see if it was a VFIO variant driver.
As a result of discussion of V1, V2 is much simpler - it just assumes that any driver with "vfio" in the name is a VFIO variant. This is okay because 1) QEMU will still catch it and libvirt will properly log the error if the driver isn't actually a VFIO variant, and 2) it's a temporary situation, just to enable use of VFIO variant drivers with libvirt until a standard method of detecting this is added to sysfs (which, according to the discussion of V1, is coming in the near future).
(NB: I did implement checking of /lib/modules/`uname -r`/modules.alias as suggested by Erik, but it turned out that this caused the unit tests to call uname(3) and open the modules.alias file on the test host - for a proper unit test I would have also needed to mock these two functions, and it seemed like too much complexity for a temporary workaround. I've implemented Jason's suggestion here (accept any driver with "vfio" in the name), which is similar to danpb's suggestion (accept specifically the two drivers that are already in the upstream kernel), but will also allow for new drivers that may be under development.)
src/hypervisor/virhostdev.c | 26 ++++--------- src/util/virpci.c | 76 ++++++++++++++++++++++++++++++++++--- src/util/virpci.h | 3 ++ 3 files changed, 82 insertions(+), 23 deletions(-)
diff --git a/src/hypervisor/virhostdev.c b/src/hypervisor/virhostdev.c index c0ce867596..15b35fa75e 100644 --- a/src/hypervisor/virhostdev.c +++ b/src/hypervisor/virhostdev.c @@ -747,9 +747,8 @@ virHostdevPreparePCIDevicesImpl(virHostdevManager *mgr, mgr->inactivePCIHostdevs) < 0) goto reattachdevs; } else { - g_autofree char *driverPath = NULL; - g_autofree char *driverName = NULL; - int stub; + g_autofree char *drvName = NULL; + virPCIStubDriver drvType;
/* Unmanaged devices should already have been marked as * inactive: if that's the case, we can simply move on */ @@ -769,18 +768,14 @@ virHostdevPreparePCIDevicesImpl(virHostdevManager *mgr, * information about active / inactive device across * daemon restarts has been implemented */
- if (virPCIDeviceGetDriverPathAndName(pci, - &driverPath, &driverName) < 0) + if (virPCIDeviceGetDriverNameAndType(pci, &drvName, &drvType) < 0) goto reattachdevs;
- stub = virPCIStubDriverTypeFromString(driverName); - - if (stub > VIR_PCI_STUB_DRIVER_NONE && - stub < VIR_PCI_STUB_DRIVER_LAST) { + if (drvType > VIR_PCI_STUB_DRIVER_NONE) {
/* The device is bound to a known stub driver: store this * information and add a copy to the inactive list */ - virPCIDeviceSetStubDriver(pci, stub); + virPCIDeviceSetStubDriver(pci, drvType);
VIR_DEBUG("Adding PCI device %s to inactive list", virPCIDeviceGetName(pci)); @@ -2292,18 +2287,13 @@ virHostdevPrepareOneNVMeDevice(virHostdevManager *hostdev_mgr, /* Let's check if all PCI devices are NVMe disks. */ for (i = 0; i < virPCIDeviceListCount(pciDevices); i++) { virPCIDevice *pci = virPCIDeviceListGet(pciDevices, i); - g_autofree char *drvPath = NULL; g_autofree char *drvName = NULL; - int stub = VIR_PCI_STUB_DRIVER_NONE; + virPCIStubDriver drvType;
- if (virPCIDeviceGetDriverPathAndName(pci, &drvPath, &drvName) < 0) + if (virPCIDeviceGetDriverNameAndType(pci, &drvName, &drvType) < 0) goto cleanup;
- if (drvName) - stub = virPCIStubDriverTypeFromString(drvName); - - if (stub == VIR_PCI_STUB_DRIVER_VFIO || - STREQ_NULLABLE(drvName, "nvme")) + if (drvType == VIR_PCI_STUB_DRIVER_VFIO || STREQ_NULLABLE(drvName, "nvme")) continue;
VIR_WARN("Suspicious NVMe disk assignment. PCI device " diff --git a/src/util/virpci.c b/src/util/virpci.c index 7800966963..51ccf4d9fd 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -277,6 +277,71 @@ virPCIDeviceGetDriverPathAndName(virPCIDevice *dev, char **path, char **name) }
+/** + * virPCIDeviceGetDriverNameAndType: + * @dev: virPCIDevice object to examine + * @drvName: returns name of driver bound to this device (if any) + * @drvType: returns type of driver if it is a known stub driver type + * + * Find the name of the driver bound to @dev (if any) and the type of + * the driver if it is a known/recognized "stub" driver (based on the + * driver name). + * + * There are vfio "variant" drivers that provide all the basic + * functionality of the standard vfio-pci driver as well as additional + * stuff. There is a plan to add info to sysfs that will allow easily + * determining if a driver is a vfio variant driver, but that sysfs + * entry isn't yet available. In the meantime as a workaround so that + * the few existing vfio variant drivers can be used with libvirt, and + * so that driver developers can test their new vfio variant drivers + * without needing to bypass libvirt, we also check if the driver name + * contains the string "vfio"; if it does, then we consider this drier + * as type VFIO. This can lead to false positives, but that isn't a + * horrible thing, because the problem will still be caught by QEMU as + * soon as libvirt makes the request to attach the device. + * + * Return 0 on success, -1 on failure. If -1 is returned, then an error + * message has been logged. + */ +int +virPCIDeviceGetDriverNameAndType(virPCIDevice *dev, + char **drvName, + virPCIStubDriver *drvType) +{ + g_autofree char *drvPath = NULL; + int tmpType; + + if (virPCIDeviceGetDriverPathAndName(dev, &drvPath, drvName) < 0) + return -1; + + if (!*drvName) { + *drvType = VIR_PCI_STUB_DRIVER_NONE; + return 0; + } + + tmpType = virPCIStubDriverTypeFromString(*drvName); + + if (tmpType > VIR_PCI_STUB_DRIVER_NONE) { + *drvType = tmpType; + return 0; /* exact match of a known driver name (or no name) */ + } + + /* Check if the drivername contains "vfio" and count as a VFIO + * driver if so - see above for explanation. + */ + + if (strstr(*drvName, "vfio")) { + VIR_DEBUG("Driver %s is a vfio_pci driver", *drvName); + *drvType = VIR_PCI_STUB_DRIVER_VFIO; + } else { + VIR_DEBUG("Driver %s is NOT a vfio_pci driver", *drvName); + *drvType = VIR_PCI_STUB_DRIVER_NONE; + } + + return 0; +} + + static int virPCIDeviceConfigOpenInternal(virPCIDevice *dev, bool readonly, bool fatal) { @@ -1004,8 +1069,8 @@ virPCIDeviceReset(virPCIDevice *dev, virPCIDeviceList *activeDevs, virPCIDeviceList *inactiveDevs) { - g_autofree char *drvPath = NULL; g_autofree char *drvName = NULL; + virPCIStubDriver drvType; int ret = -1; int fd = -1; int hdrType = -1; @@ -1032,15 +1097,16 @@ virPCIDeviceReset(virPCIDevice *dev, * reset it whenever appropriate, so doing it ourselves would just * be redundant. */ - if (virPCIDeviceGetDriverPathAndName(dev, &drvPath, &drvName) < 0) + if (virPCIDeviceGetDriverNameAndType(dev, &drvName, &drvType) < 0) goto cleanup;
- if (virPCIStubDriverTypeFromString(drvName) == VIR_PCI_STUB_DRIVER_VFIO) { - VIR_DEBUG("Device %s is bound to vfio-pci - skip reset", - dev->name); + if (drvType == VIR_PCI_STUB_DRIVER_VFIO) { + + VIR_DEBUG("Device %s is bound to %s - skip reset", dev->name, drvName); ret = 0; goto cleanup; } + VIR_DEBUG("Resetting device %s", dev->name);
if ((fd = virPCIDeviceConfigOpenWrite(dev)) < 0) diff --git a/src/util/virpci.h b/src/util/virpci.h index 4d9193f24e..0532b90f90 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -280,6 +280,9 @@ int virPCIDeviceRebind(virPCIDevice *dev); int virPCIDeviceGetDriverPathAndName(virPCIDevice *dev, char **path, char **name); +int virPCIDeviceGetDriverNameAndType(virPCIDevice *dev, + char **drvName, + virPCIStubDriver *drvType);
int virPCIDeviceIsPCIExpress(virPCIDevice *dev); int virPCIDeviceHasPCIExpressLink(virPCIDevice *dev);

On 8/23/22 16:11, Laine Stump wrote:
ping.
I have a different version of this patch where I do read the modules.alias file rather than just checking the name of the driver, but that also requires "double mocking" open() in the unit test, which wasn't working properly, and I'd rather not spend the time figuring it out if it's not going to be needed. (Alex prefers that version because it is more correct than just checking the name, and he's concerned that the new sysfs-based API may take longer than we're thinking to get into downstream distros, but the version in this patch does satisfy both Jason and Daniel's suggested implementations). Anyway, I can post the other patch if anyone is interested.
If you want I can help you with the mocking issue. But for this patch - I don't have a strong opinion either way, sorry. Maybe I lean more towards this solution, because it's simpler. Michal

On Tue, 23 Aug 2022 10:11:32 -0400 Laine Stump <laine@redhat.com> wrote:
ping.
I have a different version of this patch where I do read the modules.alias file rather than just checking the name of the driver, but that also requires "double mocking" open() in the unit test, which wasn't working properly, and I'd rather not spend the time figuring it out if it's not going to be needed. (Alex prefers that version because it is more correct than just checking the name, and he's concerned that the new sysfs-based API may take longer than we're thinking to get into downstream distros, but the version in this patch does satisfy both Jason and Daniel's suggested implementations). Anyway, I can post the other patch if anyone is interested.
Yeah, I'm still not a fan of this approach. We're essentially inventing a requirement in libvirt for a kernel driver naming convention, because it happens to work. For now. Hacky temporary solutions have been known to be longer lived than anticipated. This eventually deteriorates into managing a list of drivers that don't meet the convention, frustrating developers unaware of this arbitrary requirement and/or delaying usability through libvirt. Thanks, Alex

On Tue, Aug 23, 2022 at 01:51:33PM -0600, Alex Williamson wrote:
On Tue, 23 Aug 2022 10:11:32 -0400 Laine Stump <laine@redhat.com> wrote:
ping.
I have a different version of this patch where I do read the modules.alias file rather than just checking the name of the driver, but that also requires "double mocking" open() in the unit test, which wasn't working properly, and I'd rather not spend the time figuring it out if it's not going to be needed. (Alex prefers that version because it is more correct than just checking the name, and he's concerned that the new sysfs-based API may take longer than we're thinking to get into downstream distros, but the version in this patch does satisfy both Jason and Daniel's suggested implementations). Anyway, I can post the other patch if anyone is interested.
Yeah, I'm still not a fan of this approach. We're essentially inventing a requirement in libvirt for a kernel driver naming convention, because it happens to work. For now. Hacky temporary solutions have been known to be longer lived than anticipated. This eventually deteriorates into managing a list of drivers that don't meet the convention, frustrating developers unaware of this arbitrary requirement and/or delaying usability through libvirt. Thanks,
Kevin&co has some patches already to do the struct device sysfs, I hope he can post them, thet looked quite close last I saw. Lets give him a few weeks - having that in hand removes the worry about endlessly hacky into the future. Jason

Hey Laine, On 23/08/2022 15:11, Laine Stump wrote:
ping.
I have a different version of this patch where I do read the modules.alias file rather than just checking the name of the driver, but that also requires "double mocking" open() in the unit test, which wasn't working properly, and I'd rather not spend the time figuring it out if it's not going to be needed. (Alex prefers that version because it is more correct than just checking the name, and he's concerned that the new sysfs-based API may take longer than we're thinking to get into downstream distros, but the version in this patch does satisfy both Jason and Daniel's suggested implementations). Anyway, I can post the other patch if anyone is interested.
[sorry for the thread necromancy] I was wondering if you're planning on respinning this work, or rather the modalias approach alternative you mention? Or perhaps we are waiting for the cdev sysfs? Though, there's still a significant amount of kernel versions to cover that won't have the sysfs entry sadly :(
On 8/11/22 1:00 AM, Laine Stump wrote:
Before a PCI device can be assigned to a guest with VFIO, that device must be bound to the vfio-pci driver rather than to the device's normal driver. The vfio-pci driver provides APIs that permit QEMU to perform all the necessary operations to make the device accessible to the guest.
There has been kernel work recently to support vendor/device-specific VFIO variant drivers that provide the basic vfio-pci driver functionality while adding support for device-specific operations (for example these device-specific drivers are planned to support live migration of certain devices). All that will be needed to make this functionality available will be to bind the new vendor-specific driver to the device (rather than the generic vfio-pci driver, which will continue to work just without the extra functionality).
But until now libvirt has required that all PCI devices being assigned to a guest with VFIO specifically have the "vfio-pci" driver bound to the device. So even if the user manually binds a shiny new vendor-specific vfio variant driver to the device (and puts "managed='no'" in the config to prevent libvirt from changing the binding), libvirt will just fail during startup of the guest (or during hotplug) because the driver bound to the device isn't exactly "vfio-pci".
This patch loosens that restriction a bit - rather than requiring that the device be bound to "vfio-pci", it also checks if the drivername contains the string "vfio" at all, and in this case allows the operation to continue. If the driver is in fact a VFIO variant, then the assignment will succeed, but if it is not a VFIO variant then QEMU will fail (and report the error back to libvirt).
In the near future (possibly by kernel 6.0) there will be a formal method of identifying a VFIO variant driver by looking in sysfs; in the meantime the inexact, but simple, method in this patch will allow users of the few existing VFIO variant drivers (and developers of new VFIO variant drivers) to use their new drivers without needing to remove libvirt from their setup - they can simply pre-bind the device to the new driver, then use "managed='no'" in their libvirt config.
NB: this patch does *not* handle automatically determining the proper vendor-specific driver and binding to it in the case of "managed='yes'". This will be implemented later when there is a widely available driver / device combo we can use for testing.
Signed-off-by: Laine Stump <laine@redhat.com> --- V1 here: https://listman.redhat.com/archives/libvir-list/2022-August/233327.html
Change in V2:
V1 used the output of modinfo to look for "vfio_pci" as an alias for a driver to see if it was a VFIO variant driver.
As a result of discussion of V1, V2 is much simpler - it just assumes that any driver with "vfio" in the name is a VFIO variant. This is okay because 1) QEMU will still catch it and libvirt will properly log the error if the driver isn't actually a VFIO variant, and 2) it's a temporary situation, just to enable use of VFIO variant drivers with libvirt until a standard method of detecting this is added to sysfs (which, according to the discussion of V1, is coming in the near future).
(NB: I did implement checking of /lib/modules/`uname -r`/modules.alias as suggested by Erik, but it turned out that this caused the unit tests to call uname(3) and open the modules.alias file on the test host - for a proper unit test I would have also needed to mock these two functions, and it seemed like too much complexity for a temporary workaround. I've implemented Jason's suggestion here (accept any driver with "vfio" in the name), which is similar to danpb's suggestion (accept specifically the two drivers that are already in the upstream kernel), but will also allow for new drivers that may be under development.)
src/hypervisor/virhostdev.c | 26 ++++--------- src/util/virpci.c | 76 ++++++++++++++++++++++++++++++++++--- src/util/virpci.h | 3 ++ 3 files changed, 82 insertions(+), 23 deletions(-)
diff --git a/src/hypervisor/virhostdev.c b/src/hypervisor/virhostdev.c index c0ce867596..15b35fa75e 100644 --- a/src/hypervisor/virhostdev.c +++ b/src/hypervisor/virhostdev.c @@ -747,9 +747,8 @@ virHostdevPreparePCIDevicesImpl(virHostdevManager *mgr, mgr->inactivePCIHostdevs) < 0) goto reattachdevs; } else { - g_autofree char *driverPath = NULL; - g_autofree char *driverName = NULL; - int stub; + g_autofree char *drvName = NULL; + virPCIStubDriver drvType; /* Unmanaged devices should already have been marked as * inactive: if that's the case, we can simply move on */ @@ -769,18 +768,14 @@ virHostdevPreparePCIDevicesImpl(virHostdevManager *mgr, * information about active / inactive device across * daemon restarts has been implemented */ - if (virPCIDeviceGetDriverPathAndName(pci, - &driverPath, &driverName) < 0) + if (virPCIDeviceGetDriverNameAndType(pci, &drvName, &drvType) < 0) goto reattachdevs; - stub = virPCIStubDriverTypeFromString(driverName); - - if (stub > VIR_PCI_STUB_DRIVER_NONE && - stub < VIR_PCI_STUB_DRIVER_LAST) { + if (drvType > VIR_PCI_STUB_DRIVER_NONE) { /* The device is bound to a known stub driver: store this * information and add a copy to the inactive list */ - virPCIDeviceSetStubDriver(pci, stub); + virPCIDeviceSetStubDriver(pci, drvType); VIR_DEBUG("Adding PCI device %s to inactive list", virPCIDeviceGetName(pci)); @@ -2292,18 +2287,13 @@ virHostdevPrepareOneNVMeDevice(virHostdevManager *hostdev_mgr, /* Let's check if all PCI devices are NVMe disks. */ for (i = 0; i < virPCIDeviceListCount(pciDevices); i++) { virPCIDevice *pci = virPCIDeviceListGet(pciDevices, i); - g_autofree char *drvPath = NULL; g_autofree char *drvName = NULL; - int stub = VIR_PCI_STUB_DRIVER_NONE; + virPCIStubDriver drvType; - if (virPCIDeviceGetDriverPathAndName(pci, &drvPath, &drvName) < 0) + if (virPCIDeviceGetDriverNameAndType(pci, &drvName, &drvType) < 0) goto cleanup; - if (drvName) - stub = virPCIStubDriverTypeFromString(drvName); - - if (stub == VIR_PCI_STUB_DRIVER_VFIO || - STREQ_NULLABLE(drvName, "nvme")) + if (drvType == VIR_PCI_STUB_DRIVER_VFIO || STREQ_NULLABLE(drvName, "nvme")) continue; VIR_WARN("Suspicious NVMe disk assignment. PCI device " diff --git a/src/util/virpci.c b/src/util/virpci.c index 7800966963..51ccf4d9fd 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -277,6 +277,71 @@ virPCIDeviceGetDriverPathAndName(virPCIDevice *dev, char **path, char **name) } +/** + * virPCIDeviceGetDriverNameAndType: + * @dev: virPCIDevice object to examine + * @drvName: returns name of driver bound to this device (if any) + * @drvType: returns type of driver if it is a known stub driver type + * + * Find the name of the driver bound to @dev (if any) and the type of + * the driver if it is a known/recognized "stub" driver (based on the + * driver name). + * + * There are vfio "variant" drivers that provide all the basic + * functionality of the standard vfio-pci driver as well as additional + * stuff. There is a plan to add info to sysfs that will allow easily + * determining if a driver is a vfio variant driver, but that sysfs + * entry isn't yet available. In the meantime as a workaround so that + * the few existing vfio variant drivers can be used with libvirt, and + * so that driver developers can test their new vfio variant drivers + * without needing to bypass libvirt, we also check if the driver name + * contains the string "vfio"; if it does, then we consider this drier + * as type VFIO. This can lead to false positives, but that isn't a + * horrible thing, because the problem will still be caught by QEMU as + * soon as libvirt makes the request to attach the device. + * + * Return 0 on success, -1 on failure. If -1 is returned, then an error + * message has been logged. + */ +int +virPCIDeviceGetDriverNameAndType(virPCIDevice *dev, + char **drvName, + virPCIStubDriver *drvType) +{ + g_autofree char *drvPath = NULL; + int tmpType; + + if (virPCIDeviceGetDriverPathAndName(dev, &drvPath, drvName) < 0) + return -1; + + if (!*drvName) { + *drvType = VIR_PCI_STUB_DRIVER_NONE; + return 0; + } + + tmpType = virPCIStubDriverTypeFromString(*drvName); + + if (tmpType > VIR_PCI_STUB_DRIVER_NONE) { + *drvType = tmpType; + return 0; /* exact match of a known driver name (or no name) */ + } + + /* Check if the drivername contains "vfio" and count as a VFIO + * driver if so - see above for explanation. + */ + + if (strstr(*drvName, "vfio")) { + VIR_DEBUG("Driver %s is a vfio_pci driver", *drvName); + *drvType = VIR_PCI_STUB_DRIVER_VFIO; + } else { + VIR_DEBUG("Driver %s is NOT a vfio_pci driver", *drvName); + *drvType = VIR_PCI_STUB_DRIVER_NONE; + } + + return 0; +} + + static int virPCIDeviceConfigOpenInternal(virPCIDevice *dev, bool readonly, bool fatal) { @@ -1004,8 +1069,8 @@ virPCIDeviceReset(virPCIDevice *dev, virPCIDeviceList *activeDevs, virPCIDeviceList *inactiveDevs) { - g_autofree char *drvPath = NULL; g_autofree char *drvName = NULL; + virPCIStubDriver drvType; int ret = -1; int fd = -1; int hdrType = -1; @@ -1032,15 +1097,16 @@ virPCIDeviceReset(virPCIDevice *dev, * reset it whenever appropriate, so doing it ourselves would just * be redundant. */ - if (virPCIDeviceGetDriverPathAndName(dev, &drvPath, &drvName) < 0) + if (virPCIDeviceGetDriverNameAndType(dev, &drvName, &drvType) < 0) goto cleanup; - if (virPCIStubDriverTypeFromString(drvName) == VIR_PCI_STUB_DRIVER_VFIO) { - VIR_DEBUG("Device %s is bound to vfio-pci - skip reset", - dev->name); + if (drvType == VIR_PCI_STUB_DRIVER_VFIO) { + + VIR_DEBUG("Device %s is bound to %s - skip reset", dev->name, drvName); ret = 0; goto cleanup; } + VIR_DEBUG("Resetting device %s", dev->name); if ((fd = virPCIDeviceConfigOpenWrite(dev)) < 0) diff --git a/src/util/virpci.h b/src/util/virpci.h index 4d9193f24e..0532b90f90 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -280,6 +280,9 @@ int virPCIDeviceRebind(virPCIDevice *dev); int virPCIDeviceGetDriverPathAndName(virPCIDevice *dev, char **path, char **name); +int virPCIDeviceGetDriverNameAndType(virPCIDevice *dev, + char **drvName, + virPCIStubDriver *drvType); int virPCIDeviceIsPCIExpress(virPCIDevice *dev); int virPCIDeviceHasPCIExpressLink(virPCIDevice *dev);

On Wed, May 31, 2023 at 03:18:17PM +0100, Joao Martins wrote:
Hey Laine,
On 23/08/2022 15:11, Laine Stump wrote:
ping.
I have a different version of this patch where I do read the modules.alias file rather than just checking the name of the driver, but that also requires "double mocking" open() in the unit test, which wasn't working properly, and I'd rather not spend the time figuring it out if it's not going to be needed. (Alex prefers that version because it is more correct than just checking the name, and he's concerned that the new sysfs-based API may take longer than we're thinking to get into downstream distros, but the version in this patch does satisfy both Jason and Daniel's suggested implementations). Anyway, I can post the other patch if anyone is interested.
[sorry for the thread necromancy]
I was wondering if you're planning on respinning this work, or rather the modalias approach alternative you mention? Or perhaps we are waiting for the cdev sysfs? Though, there's still a significant amount of kernel versions to cover that won't have the sysfs entry sadly :(
The cdev sysfs has been merged to the kernel now Jason

On 31/05/2023 15:31, Jason Gunthorpe wrote:
On Wed, May 31, 2023 at 03:18:17PM +0100, Joao Martins wrote:
Hey Laine,
On 23/08/2022 15:11, Laine Stump wrote:
ping.
I have a different version of this patch where I do read the modules.alias file rather than just checking the name of the driver, but that also requires "double mocking" open() in the unit test, which wasn't working properly, and I'd rather not spend the time figuring it out if it's not going to be needed. (Alex prefers that version because it is more correct than just checking the name, and he's concerned that the new sysfs-based API may take longer than we're thinking to get into downstream distros, but the version in this patch does satisfy both Jason and Daniel's suggested implementations). Anyway, I can post the other patch if anyone is interested.
[sorry for the thread necromancy]
I was wondering if you're planning on respinning this work, or rather the modalias approach alternative you mention? Or perhaps we are waiting for the cdev sysfs? Though, there's still a significant amount of kernel versions to cover that won't have the sysfs entry sadly :(
The cdev sysfs has been merged to the kernel now
Ugh, I got confused with something else enterily. Yes, it's there since commit 3c28a76124b258 ("vfio: Add struct device to vfio_device"), v6.1 essentially;

On 5/31/23 10:31 AM, Jason Gunthorpe wrote:
On Wed, May 31, 2023 at 03:18:17PM +0100, Joao Martins wrote:
Hey Laine,
On 23/08/2022 15:11, Laine Stump wrote:
ping.
I have a different version of this patch where I do read the modules.alias file rather than just checking the name of the driver, but that also requires "double mocking" open() in the unit test, which wasn't working properly, and I'd rather not spend the time figuring it out if it's not going to be needed. (Alex prefers that version because it is more correct than just checking the name, and he's concerned that the new sysfs-based API may take longer than we're thinking to get into downstream distros, but the version in this patch does satisfy both Jason and Daniel's suggested implementations). Anyway, I can post the other patch if anyone is interested.
[sorry for the thread necromancy]
Heh. I had actually dug out this same thread last week and started a mail to ask Jason if the planned sysfs stuff had ever been pushed, but then forgot to hit "send" :-) Now that there are multiple vfio variant drivers available (for igb, e1000e, and mlx5 that I know of), it's getting more important to have them usable with libvirt, so I'm hoping to update this patch to use sysfs to determine if the driver is a vfio variant
I was wondering if you're planning on respinning this work, or rather the modalias approach alternative you mention? Or perhaps we are waiting for the cdev sysfs? Though, there's still a significant amount of kernel versions to cover that won't have the sysfs entry sadly :(
The cdev sysfs has been merged to the kernel now
Jason, can you point me at the information for this patch in an ELI5 manner for a non-kernel person? (including what upstream kernel it's in, and what it is that I need to look at to determine if a driver is a vfio variant). Thanks!

On Wed, May 31, 2023 at 01:18:44PM -0400, Laine Stump wrote:
On 5/31/23 10:31 AM, Jason Gunthorpe wrote:
On Wed, May 31, 2023 at 03:18:17PM +0100, Joao Martins wrote:
Hey Laine,
On 23/08/2022 15:11, Laine Stump wrote:
ping.
I have a different version of this patch where I do read the modules.alias file rather than just checking the name of the driver, but that also requires "double mocking" open() in the unit test, which wasn't working properly, and I'd rather not spend the time figuring it out if it's not going to be needed. (Alex prefers that version because it is more correct than just checking the name, and he's concerned that the new sysfs-based API may take longer than we're thinking to get into downstream distros, but the version in this patch does satisfy both Jason and Daniel's suggested implementations). Anyway, I can post the other patch if anyone is interested.
[sorry for the thread necromancy]
Heh. I had actually dug out this same thread last week and started a mail to ask Jason if the planned sysfs stuff had ever been pushed, but then forgot to hit "send" :-)
Now that there are multiple vfio variant drivers available (for igb, e1000e, and mlx5 that I know of),
Oh I haven't seen those intel ones posted yet?
Jason, can you point me at the information for this patch in an ELI5 manner for a non-kernel person? (including what upstream kernel it's in, and what it is that I need to look at to determine if a driver is a vfio variant).
It is this patch: commit 3c28a76124b25882411f005924be73795b6ef078 Author: Yi Liu <yi.l.liu@intel.com> Date: Wed Sep 21 18:44:01 2022 +0800 vfio: Add struct device to vfio_device and replace kref. With it a 'vfio-dev/vfioX' node is created under the sysfs path of the parent, indicating the device is bound to a vfio driver, e.g.: /sys/devices/pci0000\:6f/0000\:6f\:01.0/vfio-dev/vfio0 It is also a preparatory step toward adding cdev for supporting future device-oriented uAPI. Add Documentation/ABI/testing/sysfs-devices-vfio-dev. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20220921104401.38898-16-kevin.tian@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> $ git describe --contains 3c28a76124b25882411f005924be73795b6ef078 v6.1-rc1~42^2~35 So it is in v6.1-rc1 libvirt should start thinking about determining the vfioX number for the device, we will need that for iommufd enablement eventually so, stat for a directory like this: /sys/devices/pci0000\:6f/0000\:6f\:01.0/vfio-dev Confirms vfio Then scan it to get 'vfioX' which will eventually be the /dev/ node libvirt will have to open. And the other part is something in the stack should use the modalias mechanism to find load and bind the correct variant driver. Jason

On Wed, 31 May 2023 14:30:52 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote:
On Wed, May 31, 2023 at 01:18:44PM -0400, Laine Stump wrote:
On 5/31/23 10:31 AM, Jason Gunthorpe wrote:
On Wed, May 31, 2023 at 03:18:17PM +0100, Joao Martins wrote:
Hey Laine,
On 23/08/2022 15:11, Laine Stump wrote:
ping.
I have a different version of this patch where I do read the modules.alias file rather than just checking the name of the driver, but that also requires "double mocking" open() in the unit test, which wasn't working properly, and I'd rather not spend the time figuring it out if it's not going to be needed. (Alex prefers that version because it is more correct than just checking the name, and he's concerned that the new sysfs-based API may take longer than we're thinking to get into downstream distros, but the version in this patch does satisfy both Jason and Daniel's suggested implementations). Anyway, I can post the other patch if anyone is interested.
[sorry for the thread necromancy]
Heh. I had actually dug out this same thread last week and started a mail to ask Jason if the planned sysfs stuff had ever been pushed, but then forgot to hit "send" :-)
Now that there are multiple vfio variant drivers available (for igb, e1000e, and mlx5 that I know of),
Oh I haven't seen those intel ones posted yet?
Jason, can you point me at the information for this patch in an ELI5 manner for a non-kernel person? (including what upstream kernel it's in, and what it is that I need to look at to determine if a driver is a vfio variant).
It is this patch:
commit 3c28a76124b25882411f005924be73795b6ef078 Author: Yi Liu <yi.l.liu@intel.com> Date: Wed Sep 21 18:44:01 2022 +0800
vfio: Add struct device to vfio_device
and replace kref. With it a 'vfio-dev/vfioX' node is created under the sysfs path of the parent, indicating the device is bound to a vfio driver, e.g.:
/sys/devices/pci0000\:6f/0000\:6f\:01.0/vfio-dev/vfio0
It is also a preparatory step toward adding cdev for supporting future device-oriented uAPI.
Add Documentation/ABI/testing/sysfs-devices-vfio-dev.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20220921104401.38898-16-kevin.tian@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
$ git describe --contains 3c28a76124b25882411f005924be73795b6ef078 v6.1-rc1~42^2~35
So it is in v6.1-rc1
libvirt should start thinking about determining the vfioX number for the device, we will need that for iommufd enablement eventually
so, stat for a directory like this:
/sys/devices/pci0000\:6f/0000\:6f\:01.0/vfio-dev
Confirms vfio
Then scan it to get 'vfioX' which will eventually be the /dev/ node libvirt will have to open.
And the other part is something in the stack should use the modalias mechanism to find load and bind the correct variant driver.
I'd forgotten about this as well, so after binding a driver to a device we can tell if that driver presents a vfio interface by looking for this sysfs directory. Prior to binding to a device, we can only know if a driver provides a vfio interface through modalias. Also note that we're saying "vfio" not "vfio-pci". Only the mdev interface has the device_api attribute to indicate the exported vfio device interface. The "vfio_pci:" match in modalias indicates a vfio PCI driver, not necessarily a driver that provides a vfio-pci API. We have no current examples to the contrary, but for instance I wouldn't recommend validating whether mode='subsystem' type='pci' is appropriate based on that information. Thanks, Alex

On Wed, May 31, 2023 at 02:40:01PM -0600, Alex Williamson wrote:
Also note that we're saying "vfio" not "vfio-pci". Only the mdev interface has the device_api attribute to indicate the exported vfio device interface. The "vfio_pci:" match in modalias indicates a vfio PCI driver, not necessarily a driver that provides a vfio-pci API.
modalias was designed so you take the /sys/.../modalias file, prepend vfio_ then do a standard modalias search on that string. The matching module should be loaded and the module name bound to the device as the driver name. There should be no bus type dependencies in any of this in management code. Jason

On Wed, 31 May 2023 17:46:50 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote:
On Wed, May 31, 2023 at 02:40:01PM -0600, Alex Williamson wrote:
Also note that we're saying "vfio" not "vfio-pci". Only the mdev interface has the device_api attribute to indicate the exported vfio device interface. The "vfio_pci:" match in modalias indicates a vfio PCI driver, not necessarily a driver that provides a vfio-pci API.
modalias was designed so you take the /sys/.../modalias file, prepend vfio_ then do a standard modalias search on that string. The matching module should be loaded and the module name bound to the device as the driver name.
There should be no bus type dependencies in any of this in management code.
For example, modalias of a random wifi adapter: pci:v00008086d00002723sv00008086sd00000084bc02sc80i00 The bus name is prepended because the encoding is bus specific. Yes, code doesn't really need to interpret that, it simply adds "vfio_" to the beginning of the string and finds the matching driver with the fewest number of wildcards in modules.alias. We are not code, we have a vfio_pci driver, a vfio-pci device API, and a vfio_pci: modalias prefix, it's easy to get them confused and infer information that isn't intended. All I'm trying (poorly) to clarify is that a vfio_pci: modalias prefix only indicates a vfio driver for a PCI device. It does not guarantee the vfio device API exposed to userspace is vfio-pci. Therefore management tools should be cautious to make assumptions about the type of device the VM will see even though we've got vfio-pci written all over the place. Thanks, Alex

On Wed, May 31, 2023 at 03:34:11PM -0600, Alex Williamson wrote:
device. It does not guarantee the vfio device API exposed to userspace is vfio-pci. Therefore management tools should be cautious to make assumptions about the type of device the VM will see even though we've got vfio-pci written all over the place.
Yes. The vfio-pci API should be learned in-band with some ioctl on an opened device? Jason
participants (5)
-
Alex Williamson
-
Jason Gunthorpe
-
Joao Martins
-
Laine Stump
-
Michal Prívozník