[libvirt] [PATCH 00/10] VFIO fixes for PCI devices

This series is my attempt at fixing https://bugzilla.redhat.com/show_bug.cgi?id=1272300 In its current state, it's missing test cases covering the new functionality[1] and it's known not to handle properly one situation[2], but I'd like to get some feedback on my current work and now that I have something to show for it feels like a good time. I'm already working on the missing bits and they will either be included in the next revision or sent as separate series later on. The problem being solved is that, when using VFIO, IOMMU group ownership can't be shared, eg. two devices that are in the same IOMMU group can't be assigned to different guests, or to the host and a guest. If that happens, the host will probably crash. The series deals with this issue by making sure safety conditions are met before detaching devices from the host or reattaching them to the host. In praticular, when we're asked to reattach a device to the host but doing so would lead to sharing IOMMU group ownership, we delay the operation until we can guarantee this will not cause problems. As a nice side effect of the changes we check for this when starting a guest too, instead of assuming it will work and having QEMU error out immediately afterwards. Patches are organized as follows: 1-2: Minor cleanups that make implicit / confusing stuff explicit / less confusing 3: Convert a string field used as an enumeration to a proper enumeration 4: Introduce a simple helper function used later on 5-6: Rewrite the checks used when detaching devices from the host. With this patches applied, the behaviour is basically the same as before, except for the nice little extra detailed above 7-9: Implement the delay when reattaching devices to the host, thus preventing the crash and fixing the bug 10: Spit and polish Cheers. [1] Luckily, it doesn't break the existing tests either [2] If you call 'virsh nodedev-reattach' on a device that's assigned to a guest, libvirt won't stop you and you will end up crashing your system Andrea Bolognani (10): pci: Remove redundant parameter from virPCIDeviceBindToStub() pci: Remove 'reprobe' parameter from virPCIDeviceUnbind() pci: Introduce virPCIStubDriver enumeration pci: Introduce virPCIDeviceIOMMUGroupIterate() hostdev: Simplify virHostdevIsPCIDeviceUsed() hostdev: Check for safety before detaching VFIO devices hostdev: Delay reattach of VFIO devices hostdev: Clean up delayed VFIO devices hostdev: Devices have already been marked as inactive hostdev: Tidy up after changes to VFIO device handling src/libvirt_private.syms | 3 + src/libxl/libxl_driver.c | 3 +- src/qemu/qemu_driver.c | 6 +- src/util/virhostdev.c | 410 ++++++++++++++++++++++++++++++++++------------- src/util/virpci.c | 125 +++++++++------ src/util/virpci.h | 26 ++- src/xen/xen_driver.c | 3 +- tests/virhostdevtest.c | 5 +- tests/virpcitest.c | 35 ++-- 9 files changed, 427 insertions(+), 189 deletions(-) -- 2.5.0

This internal function supports, in theory, binding to a different stub driver than the one the PCI device has been configured to use. In practice, it is only ever called like virPCIDeviceBindToStub(dev, dev->stubDriver); which makes its second parameter redundant. Get rid of it, along with the extra string copy required to support it. --- src/util/virpci.c | 17 +++++------------ 1 file changed, 5 insertions(+), 12 deletions(-) diff --git a/src/util/virpci.c b/src/util/virpci.c index bececb5..f57070d 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1176,20 +1176,18 @@ virPCIDeviceUnbindFromStub(virPCIDevicePtr dev) static int -virPCIDeviceBindToStub(virPCIDevicePtr dev, - const char *stubDriverName) +virPCIDeviceBindToStub(virPCIDevicePtr dev) { int result = -1; bool reprobe = false; char *stubDriverPath = NULL; char *driverLink = NULL; char *path = NULL; /* reused for different purposes */ - char *newDriverName = NULL; + char *stubDriverName = dev->stubDriver; virErrorPtr err = NULL; if (!(stubDriverPath = virPCIDriverDir(stubDriverName)) || - !(driverLink = virPCIFile(dev->name, "driver")) || - VIR_STRDUP(newDriverName, stubDriverName) < 0) + !(driverLink = virPCIFile(dev->name, "driver"))) goto cleanup; if (virFileExists(driverLink)) { @@ -1304,13 +1302,8 @@ virPCIDeviceBindToStub(virPCIDevicePtr dev, VIR_FREE(driverLink); VIR_FREE(path); - if (result < 0) { - VIR_FREE(newDriverName); + if (result < 0) virPCIDeviceUnbindFromStub(dev); - } else { - VIR_FREE(dev->stubDriver); - dev->stubDriver = newDriverName; - } if (err) virSetError(err); @@ -1353,7 +1346,7 @@ virPCIDeviceDetach(virPCIDevicePtr dev, return -1; } - if (virPCIDeviceBindToStub(dev, dev->stubDriver) < 0) + if (virPCIDeviceBindToStub(dev) < 0) return -1; /* Add *a copy of* the dev into list inactiveDevs, if -- 2.5.0

The value is not inspected inside the function, so it makes more sense for the caller to change the device's setting explicitly. --- src/util/virpci.c | 10 ++++++---- src/util/virpci.h | 2 +- tests/virpcitest.c | 2 +- 3 files changed, 8 insertions(+), 6 deletions(-) diff --git a/src/util/virpci.c b/src/util/virpci.c index f57070d..8094735 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1036,7 +1036,7 @@ virPCIProbeStubDriver(const char *driver) } int -virPCIDeviceUnbind(virPCIDevicePtr dev, bool reprobe) +virPCIDeviceUnbind(virPCIDevicePtr dev) { char *path = NULL; char *drvpath = NULL; @@ -1062,7 +1062,6 @@ virPCIDeviceUnbind(virPCIDevicePtr dev, bool reprobe) dev->name, driver); goto cleanup; } - dev->reprobe = reprobe; } ret = 0; @@ -1115,7 +1114,7 @@ virPCIDeviceUnbindFromStub(virPCIDevicePtr dev) if (!isStub) goto remove_slot; - if (virPCIDeviceUnbind(dev, dev->reprobe) < 0) + if (virPCIDeviceUnbind(dev) < 0) goto cleanup; dev->unbind_from_stub = false; @@ -1229,9 +1228,12 @@ virPCIDeviceBindToStub(virPCIDevicePtr dev) goto remove_id; } - if (virPCIDeviceUnbind(dev, reprobe) < 0) + if (virPCIDeviceUnbind(dev) < 0) goto remove_id; + /* If the device was bound to a driver we'll need to reprobe later */ + dev->reprobe = reprobe; + /* If the device isn't already bound to pci-stub, try binding it now. */ if (!virFileLinkPointsTo(driverLink, stubDriverPath)) { diff --git a/src/util/virpci.h b/src/util/virpci.h index 6516f05..d9911d0 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -201,7 +201,7 @@ int virPCIDeviceAddressParse(char *address, virPCIDeviceAddressPtr bdf); int virPCIGetVirtualFunctionInfo(const char *vf_sysfs_device_path, char **pfname, int *vf_index); -int virPCIDeviceUnbind(virPCIDevicePtr dev, bool reprobe); +int virPCIDeviceUnbind(virPCIDevicePtr dev); int virPCIDeviceGetDriverPathAndName(virPCIDevicePtr dev, char **path, char **name); diff --git a/tests/virpcitest.c b/tests/virpcitest.c index d4d3253..35b4781 100644 --- a/tests/virpcitest.c +++ b/tests/virpcitest.c @@ -341,7 +341,7 @@ testVirPCIDeviceUnbind(const void *opaque) if (!dev) goto cleanup; - if (virPCIDeviceUnbind(dev, false) < 0) + if (virPCIDeviceUnbind(dev) < 0) goto cleanup; ret = 0; -- 2.5.0

This replaces the virPCIKnownStubs string array that was used internally for stub driver validation. Advantages: * possible values are well-defined * typos in driver names will be detected at compile time * avoids having several copies of the same string around * no error checking required when setting / getting value The names used mirror those in the virDomainHostdevSubsysPCIBackendType enumeration. --- src/libvirt_private.syms | 2 ++ src/libxl/libxl_driver.c | 3 +- src/qemu/qemu_driver.c | 6 ++-- src/util/virhostdev.c | 43 +++++++++------------------- src/util/virpci.c | 74 ++++++++++++++++++++++++++++-------------------- src/util/virpci.h | 16 ++++++++--- src/xen/xen_driver.c | 3 +- tests/virhostdevtest.c | 5 ++-- tests/virpcitest.c | 33 ++++++++++++++------- 9 files changed, 100 insertions(+), 85 deletions(-) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index dd085c3..c1fd9f6 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1999,6 +1999,8 @@ virPCIGetVirtualFunctionIndex; virPCIGetVirtualFunctionInfo; virPCIGetVirtualFunctions; virPCIIsVirtualFunction; +virPCIStubDriverTypeFromString; +virPCIStubDriverTypeToString; # util/virpidfile.h diff --git a/src/libxl/libxl_driver.c b/src/libxl/libxl_driver.c index 35d7fae..c28b884 100644 --- a/src/libxl/libxl_driver.c +++ b/src/libxl/libxl_driver.c @@ -4972,8 +4972,7 @@ libxlNodeDeviceDetachFlags(virNodeDevicePtr dev, goto cleanup; if (!driverName || STREQ(driverName, "xen")) { - if (virPCIDeviceSetStubDriver(pci, "pciback") < 0) - goto cleanup; + virPCIDeviceSetStubDriver(pci, VIR_PCI_STUB_DRIVER_XEN); } else { virReportError(VIR_ERR_INVALID_ARG, _("unsupported driver name '%s'"), driverName); diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index ae1d8e7..91fb1c4 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -12860,8 +12860,7 @@ qemuNodeDeviceDetachFlags(virNodeDevicePtr dev, "supported on this system")); goto cleanup; } - if (virPCIDeviceSetStubDriver(pci, "vfio-pci") < 0) - goto cleanup; + virPCIDeviceSetStubDriver(pci, VIR_PCI_STUB_DRIVER_VFIO); } else if (STREQ(driverName, "kvm")) { if (!legacy) { virReportError(VIR_ERR_ARGUMENT_UNSUPPORTED, "%s", @@ -12869,8 +12868,7 @@ qemuNodeDeviceDetachFlags(virNodeDevicePtr dev, "supported on this system")); goto cleanup; } - if (virPCIDeviceSetStubDriver(pci, "pci-stub") < 0) - goto cleanup; + virPCIDeviceSetStubDriver(pci, VIR_PCI_STUB_DRIVER_KVM); } else { virReportError(VIR_ERR_INVALID_ARG, _("unknown driver name '%s'"), driverName); diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index de029a0..33a45cb 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -237,22 +237,12 @@ virHostdevGetPCIHostDeviceList(virDomainHostdevDefPtr *hostdevs, int nhostdevs) } virPCIDeviceSetManaged(dev, hostdev->managed); - if (pcisrc->backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { - if (virPCIDeviceSetStubDriver(dev, "vfio-pci") < 0) { - virObjectUnref(list); - return NULL; - } - } else if (pcisrc->backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_XEN) { - if (virPCIDeviceSetStubDriver(dev, "pciback") < 0) { - virObjectUnref(list); - return NULL; - } - } else { - if (virPCIDeviceSetStubDriver(dev, "pci-stub") < 0) { - virObjectUnref(list); - return NULL; - } - } + if (pcisrc->backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) + virPCIDeviceSetStubDriver(dev, VIR_PCI_STUB_DRIVER_VFIO); + else if (pcisrc->backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_XEN) + virPCIDeviceSetStubDriver(dev, VIR_PCI_STUB_DRIVER_XEN); + else + virPCIDeviceSetStubDriver(dev, VIR_PCI_STUB_DRIVER_KVM); } return list; @@ -574,7 +564,7 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr hostdev_mgr, for (i = 0; i < virPCIDeviceListCount(pcidevs); i++) { virPCIDevicePtr dev = virPCIDeviceListGet(pcidevs, i); bool strict_acs_check = !!(flags & VIR_HOSTDEV_STRICT_ACS_CHECK); - bool usesVfio = STREQ(virPCIDeviceGetStubDriver(dev), "vfio-pci"); + bool usesVfio = (virPCIDeviceGetStubDriver(dev) == VIR_PCI_STUB_DRIVER_VFIO); struct virHostdevIsPCINodeDeviceUsedData data = {hostdev_mgr, dom_name, usesVfio}; @@ -749,7 +739,7 @@ virHostdevReattachPCIDevice(virPCIDevicePtr dev, virHostdevManagerPtr mgr) } /* Wait for device cleanup if it is qemu/kvm */ - if (STREQ(virPCIDeviceGetStubDriver(dev), "pci-stub")) { + if (virPCIDeviceGetStubDriver(dev) == VIR_PCI_STUB_DRIVER_KVM) { int retries = 100; while (virPCIDeviceWaitForCleanup(dev, "kvm_assigned_device") && retries) { @@ -917,17 +907,12 @@ virHostdevUpdateActivePCIDevices(virHostdevManagerPtr mgr, goto cleanup; virPCIDeviceSetManaged(dev, hostdev->managed); - if (pcisrc->backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { - if (virPCIDeviceSetStubDriver(dev, "vfio-pci") < 0) - goto cleanup; - } else if (pcisrc->backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_XEN) { - if (virPCIDeviceSetStubDriver(dev, "pciback") < 0) - goto cleanup; - } else { - if (virPCIDeviceSetStubDriver(dev, "pci-stub") < 0) - goto cleanup; - - } + if (pcisrc->backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) + virPCIDeviceSetStubDriver(dev, VIR_PCI_STUB_DRIVER_VFIO); + else if (pcisrc->backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_XEN) + virPCIDeviceSetStubDriver(dev, VIR_PCI_STUB_DRIVER_XEN); + else + virPCIDeviceSetStubDriver(dev, VIR_PCI_STUB_DRIVER_KVM); virPCIDeviceSetUsedBy(dev, drv_name, dom_name); /* Setup the original states for the PCI device */ diff --git a/src/util/virpci.c b/src/util/virpci.c index 8094735..e82583a 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -55,6 +55,12 @@ VIR_LOG_INIT("util.pci"); VIR_ENUM_IMPL(virPCIELinkSpeed, VIR_PCIE_LINK_SPEED_LAST, "", "2.5", "5", "8") +VIR_ENUM_IMPL(virPCIStubDriver, VIR_PCI_STUB_DRIVER_LAST, + "pciback", /* XEN */ + "pci-stub", /* KVM */ + "vfio-pci", /* VFIO */ +); + struct _virPCIDevice { unsigned int domain; unsigned int bus; @@ -74,7 +80,8 @@ struct _virPCIDevice { bool has_flr; bool has_pm_reset; bool managed; - char *stubDriver; + + virPCIStubDriver stubDriver; /* used by reattach function */ bool unbind_from_stub; @@ -940,7 +947,7 @@ virPCIDeviceReset(virPCIDevicePtr dev, if (virPCIDeviceGetDriverPathAndName(dev, &drvPath, &drvName) < 0) goto cleanup; - if (STREQ_NULLABLE(drvName, "vfio-pci")) { + if (virPCIStubDriverTypeFromString(drvName) == VIR_PCI_STUB_DRIVER_VFIO) { VIR_DEBUG("Device %s is bound to vfio-pci - skip reset", dev->name); ret = 0; @@ -991,13 +998,21 @@ virPCIDeviceReset(virPCIDevicePtr dev, static int -virPCIProbeStubDriver(const char *driver) +virPCIProbeStubDriver(virPCIStubDriver driver) { + const char *drvname = NULL; char *drvpath = NULL; bool probed = false; + if (!(drvname = virPCIStubDriverTypeToString(driver))) { + virReportError(VIR_ERR_INTERNAL_ERROR, + "%s", + _("Attempting to use unknown stub driver")); + return -1; + } + recheck: - if ((drvpath = virPCIDriverDir(driver)) && virFileExists(drvpath)) { + if ((drvpath = virPCIDriverDir(drvname)) && virFileExists(drvpath)) { /* driver already loaded, return */ VIR_FREE(drvpath); return 0; @@ -1008,8 +1023,8 @@ virPCIProbeStubDriver(const char *driver) if (!probed) { char *errbuf = NULL; probed = true; - if ((errbuf = virKModLoad(driver, true))) { - VIR_WARN("failed to load driver %s: %s", driver, errbuf); + if ((errbuf = virKModLoad(drvname, true))) { + VIR_WARN("failed to load driver %s: %s", drvname, errbuf); VIR_FREE(errbuf); goto cleanup; } @@ -1021,15 +1036,15 @@ virPCIProbeStubDriver(const char *driver) /* If we know failure was because of blacklist, let's report that; * otherwise, report a more generic failure message */ - if (virKModIsBlacklisted(driver)) { + if (virKModIsBlacklisted(drvname)) { virReportError(VIR_ERR_INTERNAL_ERROR, _("Failed to load PCI stub module %s: " "administratively prohibited"), - driver); + drvname); } else { virReportError(VIR_ERR_INTERNAL_ERROR, _("Failed to load PCI stub module %s"), - driver); + drvname); } return -1; @@ -1072,13 +1087,6 @@ virPCIDeviceUnbind(virPCIDevicePtr dev) return ret; } -static const char *virPCIKnownStubs[] = { - "pciback", /* used by xen */ - "pci-stub", /* used by kvm legacy passthrough */ - "vfio-pci", /* used by VFIO device assignment */ - NULL -}; - static int virPCIDeviceUnbindFromStub(virPCIDevicePtr dev) { @@ -1086,7 +1094,8 @@ virPCIDeviceUnbindFromStub(virPCIDevicePtr dev) char *drvdir = NULL; char *path = NULL; char *driver = NULL; - const char **stubTest; + virPCIStubDriver stubDriver; + const char *stubDriverName; bool isStub = false; /* If the device is currently bound to one of the "well known" @@ -1104,10 +1113,11 @@ virPCIDeviceUnbindFromStub(virPCIDevicePtr dev) goto remove_slot; /* If the device isn't bound to a known stub, skip the unbind. */ - for (stubTest = virPCIKnownStubs; *stubTest != NULL; stubTest++) { - if (STREQ(driver, *stubTest)) { + for (stubDriver = 0; stubDriver < VIR_PCI_STUB_DRIVER_LAST; stubDriver++) { + stubDriverName = virPCIStubDriverTypeToString(stubDriver); + if (stubDriverName && STREQ(driver, stubDriverName)) { isStub = true; - VIR_DEBUG("Found stub driver %s", *stubTest); + VIR_DEBUG("Found stub driver %s", stubDriverName); break; } } @@ -1182,9 +1192,16 @@ virPCIDeviceBindToStub(virPCIDevicePtr dev) char *stubDriverPath = NULL; char *driverLink = NULL; char *path = NULL; /* reused for different purposes */ - char *stubDriverName = dev->stubDriver; + const char *stubDriverName = NULL; virErrorPtr err = NULL; + if (!(stubDriverName = virPCIStubDriverTypeToString(dev->stubDriver))) { + virReportError(VIR_ERR_INTERNAL_ERROR, + "%s", + _("Attempting to use unknown stub driver")); + return -1; + } + if (!(stubDriverPath = virPCIDriverDir(stubDriverName)) || !(driverLink = virPCIFile(dev->name, "driver"))) goto cleanup; @@ -1337,8 +1354,6 @@ virPCIDeviceDetach(virPCIDevicePtr dev, virPCIDeviceList *activeDevs, virPCIDeviceList *inactiveDevs) { - sa_assert(dev->stubDriver); - if (virPCIProbeStubDriver(dev->stubDriver) < 0) return -1; @@ -1622,10 +1637,9 @@ virPCIDeviceCopy(virPCIDevicePtr dev) /* shallow copy to take care of most attributes */ *copy = *dev; - copy->path = copy->stubDriver = NULL; + copy->path = NULL; copy->used_by_drvname = copy->used_by_domname = NULL; if (VIR_STRDUP(copy->path, dev->path) < 0 || - VIR_STRDUP(copy->stubDriver, dev->stubDriver) < 0 || VIR_STRDUP(copy->used_by_drvname, dev->used_by_drvname) < 0 || VIR_STRDUP(copy->used_by_domname, dev->used_by_domname) < 0) { goto error; @@ -1645,7 +1659,6 @@ virPCIDeviceFree(virPCIDevicePtr dev) return; VIR_DEBUG("%s %s: freeing", dev->id, dev->name); VIR_FREE(dev->path); - VIR_FREE(dev->stubDriver); VIR_FREE(dev->used_by_drvname); VIR_FREE(dev->used_by_domname); VIR_FREE(dev); @@ -1694,14 +1707,13 @@ virPCIDeviceGetManaged(virPCIDevicePtr dev) return dev->managed; } -int -virPCIDeviceSetStubDriver(virPCIDevicePtr dev, const char *driver) +void +virPCIDeviceSetStubDriver(virPCIDevicePtr dev, virPCIStubDriver driver) { - VIR_FREE(dev->stubDriver); - return VIR_STRDUP(dev->stubDriver, driver); + dev->stubDriver = driver; } -const char * +virPCIStubDriver virPCIDeviceGetStubDriver(virPCIDevicePtr dev) { return dev->stubDriver; diff --git a/src/util/virpci.h b/src/util/virpci.h index d9911d0..c7bdb58 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -43,6 +43,15 @@ struct _virPCIDeviceAddress { }; typedef enum { + VIR_PCI_STUB_DRIVER_XEN = 0, + VIR_PCI_STUB_DRIVER_KVM, + VIR_PCI_STUB_DRIVER_VFIO, + VIR_PCI_STUB_DRIVER_LAST +} virPCIStubDriver; + +VIR_ENUM_DECL(virPCIStubDriver); + +typedef enum { VIR_PCIE_LINK_SPEED_NA = 0, VIR_PCIE_LINK_SPEED_25, VIR_PCIE_LINK_SPEED_5, @@ -90,10 +99,9 @@ int virPCIDeviceReset(virPCIDevicePtr dev, void virPCIDeviceSetManaged(virPCIDevice *dev, bool managed); unsigned int virPCIDeviceGetManaged(virPCIDevice *dev); -int virPCIDeviceSetStubDriver(virPCIDevicePtr dev, - const char *driver) - ATTRIBUTE_NONNULL(2); -const char *virPCIDeviceGetStubDriver(virPCIDevicePtr dev); +void virPCIDeviceSetStubDriver(virPCIDevicePtr dev, + virPCIStubDriver driver); +virPCIStubDriver virPCIDeviceGetStubDriver(virPCIDevicePtr dev); virPCIDeviceAddressPtr virPCIDeviceGetAddress(virPCIDevicePtr dev); int virPCIDeviceSetUsedBy(virPCIDevice *dev, const char *drv_name, diff --git a/src/xen/xen_driver.c b/src/xen/xen_driver.c index ce31f0f..7b2a3aa 100644 --- a/src/xen/xen_driver.c +++ b/src/xen/xen_driver.c @@ -2531,8 +2531,7 @@ xenUnifiedNodeDeviceDetachFlags(virNodeDevicePtr dev, return -1; if (!driverName) { - if (virPCIDeviceSetStubDriver(pci, "pciback") < 0) - goto out; + virPCIDeviceSetStubDriver(pci, VIR_PCI_STUB_DRIVER_XEN); } else { virReportError(VIR_ERR_INVALID_ARG, _("unknown driver name '%s'"), driverName); diff --git a/tests/virhostdevtest.c b/tests/virhostdevtest.c index faebdd4..f20992b 100644 --- a/tests/virhostdevtest.c +++ b/tests/virhostdevtest.c @@ -97,9 +97,10 @@ myInit(void) } for (i = 0; i < nhostdevs; i++) { - if (!(dev[i] = virPCIDeviceNew(0, 0, i + 1, 0)) || - virPCIDeviceSetStubDriver(dev[i], "pci-stub") < 0) + if (!(dev[i] = virPCIDeviceNew(0, 0, i + 1, 0))) goto cleanup; + + virPCIDeviceSetStubDriver(dev[i], VIR_PCI_STUB_DRIVER_KVM); } if (VIR_ALLOC(mgr) < 0) diff --git a/tests/virpcitest.c b/tests/virpcitest.c index 35b4781..b5052e5 100644 --- a/tests/virpcitest.c +++ b/tests/virpcitest.c @@ -107,10 +107,11 @@ testVirPCIDeviceDetach(const void *oaque ATTRIBUTE_UNUSED) CHECK_LIST_COUNT(inactiveDevs, 0); for (i = 0; i < nDev; i++) { - if (!(dev[i] = virPCIDeviceNew(0, 0, i + 1, 0)) || - virPCIDeviceSetStubDriver(dev[i], "pci-stub") < 0) + if (!(dev[i] = virPCIDeviceNew(0, 0, i + 1, 0))) goto cleanup; + virPCIDeviceSetStubDriver(dev[i], VIR_PCI_STUB_DRIVER_KVM); + if (virPCIDeviceDetach(dev[i], activeDevs, inactiveDevs) < 0) goto cleanup; @@ -147,10 +148,11 @@ testVirPCIDeviceReset(const void *opaque ATTRIBUTE_UNUSED) CHECK_LIST_COUNT(inactiveDevs, 0); for (i = 0; i < nDev; i++) { - if (!(dev[i] = virPCIDeviceNew(0, 0, i + 1, 0)) || - virPCIDeviceSetStubDriver(dev[i], "pci-stub") < 0) + if (!(dev[i] = virPCIDeviceNew(0, 0, i + 1, 0))) goto cleanup; + virPCIDeviceSetStubDriver(dev[i], VIR_PCI_STUB_DRIVER_KVM); + if (virPCIDeviceReset(dev[i], activeDevs, inactiveDevs) < 0) goto cleanup; } @@ -189,8 +191,7 @@ testVirPCIDeviceReattach(const void *opaque ATTRIBUTE_UNUSED) CHECK_LIST_COUNT(activeDevs, 0); CHECK_LIST_COUNT(inactiveDevs, i + 1); - if (virPCIDeviceSetStubDriver(dev[i], "pci-stub") < 0) - goto cleanup; + virPCIDeviceSetStubDriver(dev[i], VIR_PCI_STUB_DRIVER_KVM); } CHECK_LIST_COUNT(activeDevs, 0); @@ -248,8 +249,9 @@ testVirPCIDeviceDetachSingle(const void *opaque) if (!dev) goto cleanup; - if (virPCIDeviceSetStubDriver(dev, "pci-stub") < 0 || - virPCIDeviceDetach(dev, NULL, NULL) < 0) + virPCIDeviceSetStubDriver(dev, VIR_PCI_STUB_DRIVER_KVM); + + if (virPCIDeviceDetach(dev, NULL, NULL) < 0) goto cleanup; ret = 0; @@ -262,15 +264,16 @@ static int testVirPCIDeviceDetachFail(const void *opaque) { const struct testPCIDevData *data = opaque; + const char *stubDriverName = NULL; int ret = -1; virPCIDevicePtr dev; + virPCIStubDriver stubDriver; dev = virPCIDeviceNew(data->domain, data->bus, data->slot, data->function); if (!dev) goto cleanup; - if (virPCIDeviceSetStubDriver(dev, "vfio-pci") < 0) - goto cleanup; + virPCIDeviceSetStubDriver(dev, VIR_PCI_STUB_DRIVER_VFIO); if (virPCIDeviceDetach(dev, NULL, NULL) < 0) { if (virTestGetVerbose() || virTestGetDebug()) @@ -278,10 +281,18 @@ testVirPCIDeviceDetachFail(const void *opaque) virResetLastError(); ret = 0; } else { + stubDriver = virPCIDeviceGetStubDriver(dev); + if (!(stubDriverName = virPCIStubDriverTypeToString(stubDriver))) { + virReportError(VIR_ERR_INTERNAL_ERROR, + "%s", + "Attempting to use unknown stub driver"); + goto cleanup; + } + virReportError(VIR_ERR_INTERNAL_ERROR, "Attaching device %s to %s should have failed", virPCIDeviceGetName(dev), - virPCIDeviceGetStubDriver(dev)); + stubDriverName); } cleanup: -- 2.5.0

This is a straightforward wrapper around virPCIDeviceAddressIOMMUGroupIterate() that will make some code less awkward to write later on. --- src/libvirt_private.syms | 1 + src/util/virpci.c | 26 ++++++++++++++++++++++++++ src/util/virpci.h | 8 ++++++-- 3 files changed, 33 insertions(+), 2 deletions(-) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index c1fd9f6..f8aaa4c 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1967,6 +1967,7 @@ virPCIDeviceGetStubDriver; virPCIDeviceGetUnbindFromStub; virPCIDeviceGetUsedBy; virPCIDeviceHasPCIExpressLink; +virPCIDeviceIOMMUGroupIterate; virPCIDeviceIsAssignable; virPCIDeviceIsPCIExpress; virPCIDeviceListAdd; diff --git a/src/util/virpci.c b/src/util/virpci.c index e82583a..d3b2c7e 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1786,6 +1786,32 @@ void virPCIDeviceReattachInit(virPCIDevicePtr pci) pci->reprobe = true; } +/** + * virPCIDeviceIOMMUGroupIterate: + * @dev: PCI device + * @actor: function to be called for all PCI addresses in @dev's IOMMU group + * @opaque: data passed as the last parameter to @actor + * + * Convenience wrapper around virPCIDeviceAddressIOMMUGroupIterate(). + * + * Behaves exactly the same, except it takes a virPCIDevicePtr instead of a + * virPCIDeviceAddressPtr as its first argument. + * + * Returns: 0 on success, <0 on failure. + */ +int +virPCIDeviceIOMMUGroupIterate(virPCIDevicePtr dev, + virPCIDeviceAddressActor actor, + void *opaque) +{ + virPCIDeviceAddress devAddr = { dev->domain, dev->bus, + dev->slot, dev->function }; + + return virPCIDeviceAddressIOMMUGroupIterate(&devAddr, + actor, + opaque); +} + virPCIDeviceListPtr virPCIDeviceListNew(void) diff --git a/src/util/virpci.h b/src/util/virpci.h index c7bdb58..87400e0 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -120,6 +120,12 @@ void virPCIDeviceSetReprobe(virPCIDevice *dev, bool reprobe); void virPCIDeviceReattachInit(virPCIDevice *dev); +typedef int (*virPCIDeviceAddressActor)(virPCIDeviceAddress *addr, + void *opaque); +int virPCIDeviceIOMMUGroupIterate(virPCIDevicePtr dev, + virPCIDeviceAddressActor actor, + void *opaque); + virPCIDeviceListPtr virPCIDeviceListNew(void); int virPCIDeviceListAdd(virPCIDeviceListPtr list, @@ -158,8 +164,6 @@ int virPCIDeviceFileIterate(virPCIDevicePtr dev, virPCIDeviceFileActor actor, void *opaque); -typedef int (*virPCIDeviceAddressActor)(virPCIDeviceAddress *addr, - void *opaque); int virPCIDeviceAddressIOMMUGroupIterate(virPCIDeviceAddressPtr orig, virPCIDeviceAddressActor actor, void *opaque); -- 2.5.0

This function, previously called virHostdevIsPCINodeDeviceUsed(), was written to work both when used on its own and when used with virPCIDeviceAddressIOMMUGroupIterate(); as a consequence, it was slightly more convoluted than it needed to be. Rework it to perform a single task (checking that a PCI device is not used by a domain) and remove the redundant "Node" from its name. Add some documentation as well. The safety checks that are no longer performed by this function will be reintroduced later on in a different form. --- src/util/virhostdev.c | 110 ++++++++++++++++---------------------------------- 1 file changed, 35 insertions(+), 75 deletions(-) diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index 33a45cb..77757c5 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -54,51 +54,43 @@ static virClassPtr virHostdevManagerClass; static void virHostdevManagerDispose(void *obj); static virHostdevManagerPtr virHostdevManagerNew(void); -struct virHostdevIsPCINodeDeviceUsedData { - virHostdevManagerPtr hostdev_mgr; - const char *domainName; - const bool usesVfio; -}; - -static int virHostdevIsPCINodeDeviceUsed(virPCIDeviceAddressPtr devAddr, void *opaque) +/** + * virHostdevIsPCIDeviceUsed: + * @dev: PCI device + * @hostdev_mgr: hostdev manager + * + * Checks whether @dev is already assigned to a domain, reporting a detailed + * error if that's the case. + * + * Returns: true if @dev is in use by some domain, false otherwise + */ +static bool +virHostdevIsPCIDeviceUsed(virPCIDevicePtr dev, + virHostdevManagerPtr hostdev_mgr) { virPCIDevicePtr other; - int ret = -1; - virPCIDevicePtr pci = NULL; - struct virHostdevIsPCINodeDeviceUsedData *helperData = opaque; + const char *other_drvname = NULL; + const char *other_domname = NULL; + bool ret = false; - if (!(pci = virPCIDeviceNew(devAddr->domain, devAddr->bus, - devAddr->slot, devAddr->function))) - goto cleanup; + if (!(other = virPCIDeviceListFind(hostdev_mgr->activePCIHostdevs, dev))) + goto out; - other = virPCIDeviceListFind(helperData->hostdev_mgr->activePCIHostdevs, - pci); - if (other) { - const char *other_drvname = NULL; - const char *other_domname = NULL; - virPCIDeviceGetUsedBy(other, &other_drvname, &other_domname); + virPCIDeviceGetUsedBy(other, &other_drvname, &other_domname); - if (helperData->usesVfio && - (other_domname && helperData->domainName) && - (STREQ(other_domname, helperData->domainName))) - goto iommu_owner; + if (other_drvname && other_domname) + virReportError(VIR_ERR_OPERATION_INVALID, + _("PCI device %s is in use by " + "driver %s, domain %s"), + virPCIDeviceGetName(dev), + other_drvname, other_domname); + else + virReportError(VIR_ERR_OPERATION_INVALID, + _("PCI device %s is in use"), + virPCIDeviceGetName(dev)); + ret = true; - if (other_drvname && other_domname) - virReportError(VIR_ERR_OPERATION_INVALID, - _("PCI device %s is in use by " - "driver %s, domain %s"), - virPCIDeviceGetName(pci), - other_drvname, other_domname); - else - virReportError(VIR_ERR_OPERATION_INVALID, - _("PCI device %s is in use"), - virPCIDeviceGetName(pci)); - goto cleanup; - } - iommu_owner: - ret = 0; - cleanup: - virPCIDeviceFree(pci); + out: return ret; } @@ -538,7 +530,6 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr hostdev_mgr, int last_processed_hostdev_vf = -1; size_t i; int ret = -1; - virPCIDeviceAddressPtr devAddr = NULL; if (!nhostdevs) return 0; @@ -565,8 +556,6 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr hostdev_mgr, virPCIDevicePtr dev = virPCIDeviceListGet(pcidevs, i); bool strict_acs_check = !!(flags & VIR_HOSTDEV_STRICT_ACS_CHECK); bool usesVfio = (virPCIDeviceGetStubDriver(dev) == VIR_PCI_STUB_DRIVER_VFIO); - struct virHostdevIsPCINodeDeviceUsedData data = {hostdev_mgr, dom_name, - usesVfio}; if (!usesVfio && !virPCIDeviceIsAssignable(dev, strict_acs_check)) { virReportError(VIR_ERR_OPERATION_INVALID, @@ -575,23 +564,9 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr hostdev_mgr, goto cleanup; } - VIR_FREE(devAddr); - if (!(devAddr = virPCIDeviceGetAddress(dev))) - goto cleanup; - - /* The device is in use by other active domain if - * the dev is in list activePCIHostdevs. VFIO devices - * belonging to same iommu group can't be shared - * across guests. - */ - if (usesVfio) { - if (virPCIDeviceAddressIOMMUGroupIterate(devAddr, - virHostdevIsPCINodeDeviceUsed, - &data) < 0) - goto cleanup; - } else if (virHostdevIsPCINodeDeviceUsed(devAddr, &data)) { + /* Host devices can't be shared between domains */ + if (virHostdevIsPCIDeviceUsed(dev, hostdev_mgr)) goto cleanup; - } } /* Loop 2: detach managed devices (i.e. bind to appropriate stub driver) */ @@ -718,7 +693,6 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr hostdev_mgr, virObjectUnlock(hostdev_mgr->activePCIHostdevs); virObjectUnlock(hostdev_mgr->inactivePCIHostdevs); virObjectUnref(pcidevs); - VIR_FREE(devAddr); return ret; } @@ -1543,18 +1517,12 @@ int virHostdevPCINodeDeviceDetach(virHostdevManagerPtr hostdev_mgr, virPCIDevicePtr pci) { - virPCIDeviceAddressPtr devAddr = NULL; - struct virHostdevIsPCINodeDeviceUsedData data = { hostdev_mgr, NULL, - false }; int ret = -1; virObjectLock(hostdev_mgr->activePCIHostdevs); virObjectLock(hostdev_mgr->inactivePCIHostdevs); - if (!(devAddr = virPCIDeviceGetAddress(pci))) - goto out; - - if (virHostdevIsPCINodeDeviceUsed(devAddr, &data)) + if (virHostdevIsPCIDeviceUsed(pci, hostdev_mgr)) goto out; if (virPCIDeviceDetach(pci, hostdev_mgr->activePCIHostdevs, @@ -1566,7 +1534,6 @@ virHostdevPCINodeDeviceDetach(virHostdevManagerPtr hostdev_mgr, out: virObjectUnlock(hostdev_mgr->inactivePCIHostdevs); virObjectUnlock(hostdev_mgr->activePCIHostdevs); - VIR_FREE(devAddr); return ret; } @@ -1574,18 +1541,12 @@ int virHostdevPCINodeDeviceReAttach(virHostdevManagerPtr hostdev_mgr, virPCIDevicePtr pci) { - virPCIDeviceAddressPtr devAddr = NULL; - struct virHostdevIsPCINodeDeviceUsedData data = {hostdev_mgr, NULL, - false}; int ret = -1; virObjectLock(hostdev_mgr->activePCIHostdevs); virObjectLock(hostdev_mgr->inactivePCIHostdevs); - if (!(devAddr = virPCIDeviceGetAddress(pci))) - goto out; - - if (virHostdevIsPCINodeDeviceUsed(devAddr, &data)) + if (virHostdevIsPCIDeviceUsed(pci, hostdev_mgr)) goto out; virPCIDeviceReattachInit(pci); @@ -1598,7 +1559,6 @@ virHostdevPCINodeDeviceReAttach(virHostdevManagerPtr hostdev_mgr, out: virObjectUnlock(hostdev_mgr->inactivePCIHostdevs); virObjectUnlock(hostdev_mgr->activePCIHostdevs); - VIR_FREE(devAddr); return ret; } -- 2.5.0

The newly-introduced virHostdevIsPCIDeviceSafeForDetach() function checks against all problematic situations that can arise when detaching a PCI device from the host for VFIO device assignment. Some of the checks are simply reintroduced after being removed when reworking virHostdevIsPCIDeviceUsed() earlier, but some are new and allow us to detect an unsafe operation and abort immediately, instead of calling QEMU and waiting for it to error out, which is nicer from the user's point of view. --- src/util/virhostdev.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 81 insertions(+) diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index 77757c5..9174525 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -94,6 +94,67 @@ virHostdevIsPCIDeviceUsed(virPCIDevicePtr dev, return ret; } +typedef struct { + const char *domname; + virPCIDeviceListPtr activeDevices; + virPCIDeviceListPtr inactiveDevices; + virPCIDeviceListPtr pendingDevices; +} virHostdevIsPCIDeviceSafeData; + +/** + * virHostdevIsPCIDeviceSafe: + * @addr: PCI device address + * @opaque: user data (virHostdevIsPCIDeviceSafeData*) + * + * Checks whether a PCI device can be safely detached from the host for device + * assignment. + * + * It is meant to be called from virPCIDeviceAddressIOMMUGroupIterate() or + * virPCIDeviceIOMMUGroupIterate(), and the checks it performs only make + * sense when using VFIO device assignment. + * + * Returns: 0 if the device can be safely detached from the host, <0 otherwise + */ +static int +virHostdevIsPCIDeviceSafeForDetach(virPCIDeviceAddressPtr addr, + void *opaque) +{ + virHostdevIsPCIDeviceSafeData *data = opaque; + virPCIDevicePtr other = NULL; + + /* Pending devices are safe: they're the ones we're handling right now */ + if (data->pendingDevices && + virPCIDeviceListFindByIDs(data->pendingDevices, + addr->domain, addr->bus, + addr->slot, addr->function)) + return 0; + + /* Inactive devices are safe: they're already bound to a stub driver */ + if (data->inactiveDevices && + virPCIDeviceListFindByIDs(data->inactiveDevices, + addr->domain, addr->bus, + addr->slot, addr->function)) + return 0; + + /* Active devices are safe as long as they're used by the same domain */ + if (data->activeDevices && + (other = virPCIDeviceListFindByIDs(data->activeDevices, + addr->domain, addr->bus, + addr->slot, addr->function))) { + const char *other_domname = NULL; + const char *other_drvname = NULL; + + virPCIDeviceGetUsedBy(other, &other_drvname, &other_domname); + + if (STREQ_NULLABLE(data->domname, other_domname)) + return 0; + } + + /* If we got this far it means that the device is used by either another + * domain or the host, so it shouldn't be considered safe for assignment */ + return -1; +} + static int virHostdevManagerOnceInit(void) { if (!(virHostdevManagerClass = virClassNew(virClassForObject(), @@ -554,6 +615,10 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr hostdev_mgr, for (i = 0; i < virPCIDeviceListCount(pcidevs); i++) { virPCIDevicePtr dev = virPCIDeviceListGet(pcidevs, i); + virHostdevIsPCIDeviceSafeData data = { dom_name, + hostdev_mgr->activePCIHostdevs, + hostdev_mgr->inactivePCIHostdevs, + pcidevs }; bool strict_acs_check = !!(flags & VIR_HOSTDEV_STRICT_ACS_CHECK); bool usesVfio = (virPCIDeviceGetStubDriver(dev) == VIR_PCI_STUB_DRIVER_VFIO); @@ -567,6 +632,22 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr hostdev_mgr, /* Host devices can't be shared between domains */ if (virHostdevIsPCIDeviceUsed(dev, hostdev_mgr)) goto cleanup; + + /* When using VFIO, we have to take care not to end up in a situation + * where some of the devices in a IOMMU group are in use by the host + * and some other are assigned to a domain. KVM will actually prevent + * that from happening, but we want to bail earlier and with a better + * error message */ + if (usesVfio && + virPCIDeviceIOMMUGroupIterate(dev, + virHostdevIsPCIDeviceSafeForDetach, + &data) < 0) { + virReportError(VIR_ERR_OPERATION_INVALID, + _("PCI device %s is not assignable: " + "attempting to share IOMMU ownership"), + virPCIDeviceGetName(dev)); + goto cleanup; + } } /* Loop 2: detach managed devices (i.e. bind to appropriate stub driver) */ -- 2.5.0

When reattaching VFIO devices to the host, we have to be careful not to end up in a situation where ownership of the IOMMU group is shared between the hosts and a domain, as doing so will result in host crashes. If a PCI device is not safe for reattach, as detected by virHostdevIsPCIDeviceSafeForReattach(), we mark as inactive and delay actually reattaching it to the host to a later time when doing so will not result in shared IOMMU group ownership. Note that, with this commit, VFIO devices will no longer be reattached to the host, even when it would be safe to do so. A later commit will restore the missing functionality. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1272300 --- src/util/virhostdev.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 67 insertions(+) diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index 9174525..0f2916b 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -155,6 +155,38 @@ virHostdevIsPCIDeviceSafeForDetach(virPCIDeviceAddressPtr addr, return -1; } +/** + * virHostdevIsPCIDeviceSafeForReattach: + * @addr: PCI device address + * @opaque: user data (virHostdevIsPCIDeviceSafeData*) + * + * Checks whether a PCI device can be safely reattached to the host. + * + * It is meant to be called from virPCIDeviceAddressIOMMUGroupIterate() or + * virPCIDeviceIOMMUGroupIterate(), and the checks it performs only make + * sense when using VFIO device assignment. + * + * Returns: 0 if the device can be safely reattached to the host, <0 otherwise + */ +static int +virHostdevIsPCIDeviceSafeForReattach(virPCIDeviceAddressPtr addr, + void *opaque) +{ + virHostdevIsPCIDeviceSafeData *data = opaque; + + /* Active devices are not safe: IOMMU groups can't be shared between + * the host and its domains */ + if (data->activeDevices && + virPCIDeviceListFindByIDs(data->activeDevices, + addr->domain, addr->bus, + addr->slot, addr->function)) + return -1; + + /* All other situations are either safe (pending device, inactive device) + * or can't happen (device in use by the host) */ + return 0; +} + static int virHostdevManagerOnceInit(void) { if (!(virHostdevManagerClass = virClassNew(virClassForObject(), @@ -900,6 +932,41 @@ virHostdevReAttachPCIDevices(virHostdevManagerPtr hostdev_mgr, } } + /* Loop 3: delay handling of devices that are not yet safe for reattach */ + i = 0; + while (i < virPCIDeviceListCount(pcidevs)) { + virPCIDevicePtr dev = virPCIDeviceListGet(pcidevs, i); + virHostdevIsPCIDeviceSafeData data = { dom_name, + hostdev_mgr->activePCIHostdevs, + hostdev_mgr->inactivePCIHostdevs, + pcidevs }; + bool usesVFIO = (virPCIDeviceGetStubDriver(dev) == VIR_PCI_STUB_DRIVER_VFIO); + + if (usesVFIO && + virPCIDeviceIOMMUGroupIterate(dev, + virHostdevIsPCIDeviceSafeForReattach, + &data) < 0) { + /* Don't process the device further. Later on, when the last + * device in this IOMMU group will be unassigned from its domain, + * we will process all the devices at once, including this one */ + VIR_DEBUG("Delaying reattaching of PCI device %s", + virPCIDeviceGetName(dev)); + dev = virPCIDeviceListSteal(pcidevs, dev); + } else { + /* This device is safe, keep it in the list and skip ahead */ + VIR_DEBUG("PCI device %s can be safely reattached", + virPCIDeviceGetName(dev)); + i++; + } + + /* Mark the device as inactive */ + virPCIDeviceListAddCopy(hostdev_mgr->inactivePCIHostdevs, dev); + } + + /* At this point, all devices we were asked to reattach have been removed + * from activePCIHostdevs and have been added to inactivePCIHostdevs; + * pcidevs contains the devices that we can safely process right now */ + /* Loop 3: perform a PCI Reset on all devices */ for (i = 0; i < virPCIDeviceListCount(pcidevs); i++) { virPCIDevicePtr dev = virPCIDeviceListGet(pcidevs, i); -- 2.5.0

When we determine that a VFIO device can be safely reattached to the host, we have to also reattach all devices we might have delayed earlier in order to clean up properly. The new function virHostdevExpandPCIDeviceListIOMMUGroups() helps by adding inactive devices that are part of the same IOMMU group to the working set. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1272300 --- src/util/virhostdev.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 86 insertions(+) diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index 0f2916b..1042ab3 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -809,6 +809,84 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr hostdev_mgr, return ret; } +/* virHostdevExpandPCIDeviceListIOMMUGroups: + * @list: list of PCI devices + * @otherList: list of PCI devices used to expand @list + * + * Expand IOMMU groups in @list by adding all devices in @otherList that + * are in the same IOMMU group as a device in @list. + * + * No device is removed from @otherList. + * + * Returns: 0 on success, <0 on failure + */ +static int +virHostdevExpandPCIDeviceListIOMMUGroups(virPCIDeviceListPtr list, + virPCIDeviceListPtr otherList) +{ + virPCIDeviceListPtr addList = NULL; + virPCIDeviceListPtr tmpList = NULL; + size_t i; + int ret = -1; + + /* addList will temporarily store all devices we're going to add */ + if (!(addList = virPCIDeviceListNew())) + goto cleanup; + + for (i = 0; i < virPCIDeviceListCount(list); i++) { + virPCIDevicePtr dev = virPCIDeviceListGet(list, i); + size_t j; + + /* Skip non-VFIO devices */ + if (virPCIDeviceGetStubDriver(dev) != VIR_PCI_STUB_DRIVER_VFIO) + continue; + + /* Get the list of devices in the same IOMMU group */ + virObjectUnref(tmpList); + if (!(tmpList = virPCIDeviceGetIOMMUGroupList(dev))) + goto cleanup; + + for (j = 0; j < virPCIDeviceListCount(tmpList); j++) { + virPCIDevicePtr tmpDev = virPCIDeviceListGet(tmpList, j); + virPCIDevicePtr otherDev; + + /* Devices that are present in the original list or we have already + * decided we want to add must be skipped to avoid duplicates */ + if (virPCIDeviceListFind(list, tmpDev)) + continue; + if (virPCIDeviceListFind(addList, tmpDev)) + continue; + + /* Devices that do not appear in otherList should not be included. + * We retrieve the device stored in otherList instead of using the + * one in tmpList because it might contain extra information, + * eg. what stub driver should be used when performing device + * assignment */ + if (!(otherDev = virPCIDeviceListFind(otherList, tmpDev))) + continue; + + /* Make a copy of the device so we can add it later */ + if (virPCIDeviceListAddCopy(addList, otherDev) < 0) + goto cleanup; + } + } + + /* Move all devices we've copied earlier to the list */ + while (virPCIDeviceListCount(addList) > 0) { + virPCIDevicePtr addDev = virPCIDeviceListStealIndex(addList, 0); + if (virPCIDeviceListAdd(list, addDev) < 0) + goto cleanup; + } + + ret = 0; + + cleanup: + virObjectUnref(tmpList); + virObjectUnref(addList); + + return ret; +} + /* * Pre-condition: inactivePCIHostdevs & activePCIHostdevs * are locked @@ -967,6 +1045,14 @@ virHostdevReAttachPCIDevices(virHostdevManagerPtr hostdev_mgr, * from activePCIHostdevs and have been added to inactivePCIHostdevs; * pcidevs contains the devices that we can safely process right now */ + /* We need to reset and reattach not only them, but also all inactive + * devices that are in the same IOMMU group if VFIO is in use. + * virHostdevExpandPCIDeviceListIOMMUGroups() takes care of updating + * the device list appropriately */ + if (virHostdevExpandPCIDeviceListIOMMUGroups(pcidevs, + hostdev_mgr->inactivePCIHostdevs) < 0) + goto cleanup; + /* Loop 3: perform a PCI Reset on all devices */ for (i = 0; i < virPCIDeviceListCount(pcidevs); i++) { virPCIDevicePtr dev = virPCIDeviceListGet(pcidevs, i); -- 2.5.0

We mark all VFIO devices as inactive after we've detached them from the domain, so that step doesn't need to be performed afterwards. --- src/util/virhostdev.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index 1042ab3..b350de0 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -894,12 +894,10 @@ virHostdevExpandPCIDeviceListIOMMUGroups(virPCIDeviceListPtr list, static void virHostdevReattachPCIDevice(virPCIDevicePtr dev, virHostdevManagerPtr mgr) { - /* If the device is not managed and was attached to guest - * successfully, it must have been inactive. - */ + /* We don't need to do anything for unmanaged devices */ if (!virPCIDeviceGetManaged(dev)) { - if (virPCIDeviceListAdd(mgr->inactivePCIHostdevs, dev) < 0) - virPCIDeviceFree(dev); + VIR_DEBUG("Skipping reattach of unmanaged PCI device %s", + virPCIDeviceGetName(dev)); return; } -- 2.5.0

Add a couple of useful debug messages, update comments to reflect recent changes, use the proper case for VFIO in variable names and fix parameter alignment in a function definition. --- src/util/virhostdev.c | 27 ++++++++++++++++----------- 1 file changed, 16 insertions(+), 11 deletions(-) diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index b350de0..31cd152 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -652,9 +652,9 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr hostdev_mgr, hostdev_mgr->inactivePCIHostdevs, pcidevs }; bool strict_acs_check = !!(flags & VIR_HOSTDEV_STRICT_ACS_CHECK); - bool usesVfio = (virPCIDeviceGetStubDriver(dev) == VIR_PCI_STUB_DRIVER_VFIO); + bool usesVFIO = (virPCIDeviceGetStubDriver(dev) == VIR_PCI_STUB_DRIVER_VFIO); - if (!usesVfio && !virPCIDeviceIsAssignable(dev, strict_acs_check)) { + if (!usesVFIO && !virPCIDeviceIsAssignable(dev, strict_acs_check)) { virReportError(VIR_ERR_OPERATION_INVALID, _("PCI device %s is not assignable"), virPCIDeviceGetName(dev)); @@ -670,7 +670,7 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr hostdev_mgr, * and some other are assigned to a domain. KVM will actually prevent * that from happening, but we want to bail earlier and with a better * error message */ - if (usesVfio && + if (usesVFIO && virPCIDeviceIOMMUGroupIterate(dev, virHostdevIsPCIDeviceSafeForDetach, &data) < 0) { @@ -892,7 +892,8 @@ virHostdevExpandPCIDeviceListIOMMUGroups(virPCIDeviceListPtr list, * are locked */ static void -virHostdevReattachPCIDevice(virPCIDevicePtr dev, virHostdevManagerPtr mgr) +virHostdevReattachPCIDevice(virPCIDevicePtr dev, + virHostdevManagerPtr mgr) { /* We don't need to do anything for unmanaged devices */ if (!virPCIDeviceGetManaged(dev)) { @@ -911,6 +912,7 @@ virHostdevReattachPCIDevice(virPCIDevicePtr dev, virHostdevManagerPtr mgr) } } + VIR_DEBUG("Reattaching PCI device %s", virPCIDeviceGetName(dev)); if (virPCIDeviceReattach(dev, mgr->activePCIHostdevs, mgr->inactivePCIHostdevs) < 0) { virErrorPtr err = virGetLastError(); @@ -951,11 +953,13 @@ virHostdevReAttachPCIDevices(virHostdevManagerPtr hostdev_mgr, goto cleanup; } - /* Loop through the assigned devices 4 times: 1) delete them all from - * activePCIHostdevs, 2) restore network config of SRIOV netdevs, 3) Do a - * PCI reset on each device, 4) reattach the devices to their host drivers - * (managed) or add them to inactivePCIHostdevs (!managed). - */ + /* Loop through the assigned devices 5 times: 1) delete them all from + * activePCIHostdevs, 2) restore network config of SRIOV netdevs, 3) Add + * all devices to inactivePCIHostdevs, and stop processing those devices + * that can't be safely reattached to the host right now, 4) Do a PCI reset + * on each device, 5) Reattach managed devices to their host drivers. + * Between 3) and 4), we make sure devices we've previously delayed are + * handled properly */ /* * Loop 1: verify that each device in the hostdevs list really was in use @@ -1051,10 +1055,11 @@ virHostdevReAttachPCIDevices(virHostdevManagerPtr hostdev_mgr, hostdev_mgr->inactivePCIHostdevs) < 0) goto cleanup; - /* Loop 3: perform a PCI Reset on all devices */ + /* Loop 4: perform a PCI Reset on all devices */ for (i = 0; i < virPCIDeviceListCount(pcidevs); i++) { virPCIDevicePtr dev = virPCIDeviceListGet(pcidevs, i); + VIR_DEBUG("Resetting PCI device %s", virPCIDeviceGetName(dev)); if (virPCIDeviceReset(dev, hostdev_mgr->activePCIHostdevs, hostdev_mgr->inactivePCIHostdevs) < 0) { virErrorPtr err = virGetLastError(); @@ -1064,7 +1069,7 @@ virHostdevReAttachPCIDevices(virHostdevManagerPtr hostdev_mgr, } } - /* Loop 4: reattach devices to their host drivers (if managed) or place + /* Loop 5: reattach devices to their host drivers (if managed) or place * them on the inactive list (if not managed) */ while (virPCIDeviceListCount(pcidevs) > 0) { -- 2.5.0

On 12/02/2015 12:39 PM, Andrea Bolognani wrote:
This is a straightforward wrapper around virPCIDeviceAddressIOMMUGroupIterate() that will make some code less awkward to write later on. --- src/libvirt_private.syms | 1 + src/util/virpci.c | 26 ++++++++++++++++++++++++++ src/util/virpci.h | 8 ++++++-- 3 files changed, 33 insertions(+), 2 deletions(-)
diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index c1fd9f6..f8aaa4c 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1967,6 +1967,7 @@ virPCIDeviceGetStubDriver; virPCIDeviceGetUnbindFromStub; virPCIDeviceGetUsedBy; virPCIDeviceHasPCIExpressLink; +virPCIDeviceIOMMUGroupIterate; virPCIDeviceIsAssignable; virPCIDeviceIsPCIExpress; virPCIDeviceListAdd; diff --git a/src/util/virpci.c b/src/util/virpci.c index e82583a..d3b2c7e 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1786,6 +1786,32 @@ void virPCIDeviceReattachInit(virPCIDevicePtr pci) pci->reprobe = true; }
+/** + * virPCIDeviceIOMMUGroupIterate: + * @dev: PCI device + * @actor: function to be called for all PCI addresses in @dev's IOMMU group + * @opaque: data passed as the last parameter to @actor + * + * Convenience wrapper around virPCIDeviceAddressIOMMUGroupIterate(). + * + * Behaves exactly the same, except it takes a virPCIDevicePtr instead of a + * virPCIDeviceAddressPtr as its first argument. + * + * Returns: 0 on success, <0 on failure. + */ +int +virPCIDeviceIOMMUGroupIterate(virPCIDevicePtr dev, + virPCIDeviceAddressActor actor, + void *opaque) +{ + virPCIDeviceAddress devAddr = { dev->domain, dev->bus, + dev->slot, dev->function }; + + return virPCIDeviceAddressIOMMUGroupIterate(&devAddr, + actor, + opaque); +}
Instead of creating this new function, how about if we change virPCIDevice to contain a virPCIDeviceAddress rather than individual domain, bus, slot, and function?

On Thu, 2015-12-03 at 14:06 -0500, Laine Stump wrote:
On 12/02/2015 12:39 PM, Andrea Bolognani wrote:
This is a straightforward wrapper around virPCIDeviceAddressIOMMUGroupIterate() that will make some code less awkward to write later on. --- src/libvirt_private.syms | 1 + src/util/virpci.c | 26 ++++++++++++++++++++++++++ src/util/virpci.h | 8 ++++++-- 3 files changed, 33 insertions(+), 2 deletions(-)
diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index c1fd9f6..f8aaa4c 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1967,6 +1967,7 @@ virPCIDeviceGetStubDriver; virPCIDeviceGetUnbindFromStub; virPCIDeviceGetUsedBy; virPCIDeviceHasPCIExpressLink; +virPCIDeviceIOMMUGroupIterate; virPCIDeviceIsAssignable; virPCIDeviceIsPCIExpress; virPCIDeviceListAdd; diff --git a/src/util/virpci.c b/src/util/virpci.c index e82583a..d3b2c7e 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1786,6 +1786,32 @@ void virPCIDeviceReattachInit(virPCIDevicePtr pci) pci->reprobe = true; } +/** + * virPCIDeviceIOMMUGroupIterate: + * @dev: PCI device + * @actor: function to be called for all PCI addresses in @dev's IOMMU group + * @opaque: data passed as the last parameter to @actor + * + * Convenience wrapper around virPCIDeviceAddressIOMMUGroupIterate(). + * + * Behaves exactly the same, except it takes a virPCIDevicePtr instead of a + * virPCIDeviceAddressPtr as its first argument. + * + * Returns: 0 on success, <0 on failure. + */ +int +virPCIDeviceIOMMUGroupIterate(virPCIDevicePtr dev, + virPCIDeviceAddressActor actor, + void *opaque) +{ + virPCIDeviceAddress devAddr = { dev->domain, dev->bus, + dev->slot, dev->function }; + + return virPCIDeviceAddressIOMMUGroupIterate(&devAddr, + actor, + opaque); +}
Instead of creating this new function, how about if we change virPCIDevice to contain a virPCIDeviceAddress rather than individual domain, bus, slot, and function?
I think you already mentioned doing so in the past, and it's definitely a good idea. I'll get to it. After the change, we'll be able to just use virPCIDeviceAddressIOMMUGroupIterate(virPCIDeviceGetAddress(dev), actor, data); with virPCIDeviceGetAddress() returning a pointer to the address stored inside the device instead of a newly-allocated object. Cheers. -- Andrea Bolognani Software Engineer - Virtualization Team

On Fri, 2015-12-04 at 09:46 +0100, Andrea Bolognani wrote:
On Thu, 2015-12-03 at 14:06 -0500, Laine Stump wrote:
Instead of creating this new function, how about if we change virPCIDevice to contain a virPCIDeviceAddress rather than individual domain, bus, slot, and function?
I think you already mentioned doing so in the past, and it's definitely a good idea.
I'll get to it. After the change, we'll be able to just use
virPCIDeviceAddressIOMMUGroupIterate(virPCIDeviceGetAddress(dev), actor, data);
with virPCIDeviceGetAddress() returning a pointer to the address stored inside the device instead of a newly-allocated object.
Hi Laine, while I explore this, is there any chance you can take a look at the rest of the series and confirm that at least the base idea is sound? I've tried to break it down into easily digestible chunks and include plenty of comments so that hopefully getting a bird view is possible without having to do a full review. I'd very much value this kind of early feedback, even with the series still missing a couple of bits. At least the first three patches are really standalone cleanups, so maybe those could go in even without the rest? That would reduce the series size a tiny bit. Cheers. -- Andrea Bolognani Software Engineer - Virtualization Team

On Wed, 2015-12-02 at 18:17 +0100, Andrea Bolognani wrote:
This series is my attempt at fixing
[...]
The problem being solved is that, when using VFIO, IOMMU group ownership can't be shared, eg. two devices that are in the same IOMMU group can't be assigned to different guests, or to the host and a guest. If that happens, the host will probably crash.
The series deals with this issue by making sure safety conditions are met before detaching devices from the host or reattaching them to the host. In praticular, when we're asked to reattach a device to the host but doing so would lead to sharing IOMMU group ownership, we delay the operation until we can guarantee this will not cause problems. As a nice side effect of the changes we check for this when starting a guest too, instead of assuming it will work and having QEMU error out immediately afterwards.
Shivaprasad raised a concern on IRC, which I'm sharing here for wider discussion. I'm CC'ing Laine and Alex, hopefully they don't mind - let me know otherwise. Assume we have a PCI device with two functions. With this series applied, when reattaching both functions to the host this would happen: f0 remove from guest f1 remove from guest f0 unbind from vfio-pci f0 trigger host driver reprobe f1 unbind from vfio-pci f1 trigger host driver reprobe Shivaprasad is concerned this is not actually safe, and the proper sequence would rather be: f0 remove from guest f1 remove from guest f0 unbind from vfio-pci f1 unbind from vfio-pci f0 trigger host driver reprobe f1 trigger host driver reprobe Doing so would AFAICT mean basically duplicating the delay logic this series adds to virHostdev into virPCI, to ensure that devices are unbound from vfio-pci only once the same operation has been requested for all devices in the IOMMU group, and reprobe is triggered only after all devices have been unbound from vfio-pci. I was under the impression that what the current series does, eg. sharing devices in the same IOMMU group between the host driver and vfio-pci is safe as long as no guest is using them at the same time, and that devices could be safely "moved" between the host driver (eg. in use) and vfio-pci (eg. idle, waiting to be assigned to a guest) as many times as desired without ill consequences. Is my understanding wrong? Do I need to rework the series so that unbinds and reprobes are always executed across the IOMMU group? Any suggestion or pointers to relevant documentation will be very much appreciated. Cheers. -- Andrea Bolognani Software Engineer - Virtualization Team

On Wed, 2015-12-09 at 13:51 +0100, Andrea Bolognani wrote:
On Wed, 2015-12-02 at 18:17 +0100, Andrea Bolognani wrote:
This series is my attempt at fixing
[...]
The problem being solved is that, when using VFIO, IOMMU group ownership can't be shared, eg. two devices that are in the same IOMMU group can't be assigned to different guests, or to the host and a guest. If that happens, the host will probably crash.
The series deals with this issue by making sure safety conditions are met before detaching devices from the host or reattaching them to the host. In praticular, when we're asked to reattach a device to the host but doing so would lead to sharing IOMMU group ownership, we delay the operation until we can guarantee this will not cause problems. As a nice side effect of the changes we check for this when starting a guest too, instead of assuming it will work and having QEMU error out immediately afterwards.
Shivaprasad raised a concern on IRC, which I'm sharing here for wider discussion. I'm CC'ing Laine and Alex, hopefully they don't mind - let me know otherwise.
Assume we have a PCI device with two functions. With this series applied, when reattaching both functions to the host this would happen:
f0 remove from guest f1 remove from guest f0 unbind from vfio-pci f0 trigger host driver reprobe f1 unbind from vfio-pci f1 trigger host driver reprobe
Shivaprasad is concerned this is not actually safe, and the proper sequence would rather be:
f0 remove from guest f1 remove from guest f0 unbind from vfio-pci f1 unbind from vfio-pci f0 trigger host driver reprobe f1 trigger host driver reprobe
Doing so would AFAICT mean basically duplicating the delay logic this series adds to virHostdev into virPCI, to ensure that devices are unbound from vfio-pci only once the same operation has been requested for all devices in the IOMMU group, and reprobe is triggered only after all devices have been unbound from vfio-pci.
I was under the impression that what the current series does, eg. sharing devices in the same IOMMU group between the host driver and vfio-pci is safe as long as no guest is using them at the same time, and that devices could be safely "moved" between the host driver (eg. in use) and vfio-pci (eg. idle, waiting to be assigned to a guest) as many times as desired without ill consequences.
Is my understanding wrong? Do I need to rework the series so that unbinds and reprobes are always executed across the IOMMU group?
Any suggestion or pointers to relevant documentation will be very much appreciated.
Hi Andrea, Your understanding is correct, so I think it just comes down to how sure you are that the vfio group is idle/unused. If there's any risk that a device is still in use by QEMU, then we haven't solved the original problem. Unbinding isn't the only way to have good confidence of this though. You could track the QEMU pid, you could use fuser to make sure the group is not in use, you could try to open the group yourself to make sure it's not in use, and of course you can unbind as proposed in the second option. So long as you know the group is idle, somehow, it shouldn't matter what order you unbind and reprobe. Thanks, Alex

On Wed, 2015-12-09 at 08:09 -0700, Alex Williamson wrote:
I was under the impression that what the current series does, eg. sharing devices in the same IOMMU group between the host driver and vfio-pci is safe as long as no guest is using them at the same time, and that devices could be safely "moved" between the host driver (eg. in use) and vfio-pci (eg. idle, waiting to be assigned to a guest) as many times as desired without ill consequences.
Is my understanding wrong? Do I need to rework the series so that unbinds and reprobes are always executed across the IOMMU group?
Hi Andrea,
Your understanding is correct, so I think it just comes down to how sure you are that the vfio group is idle/unused. If there's any risk that a device is still in use by QEMU, then we haven't solved the original problem. Unbinding isn't the only way to have good confidence of this though. You could track the QEMU pid, you could use fuser to make sure the group is not in use, you could try to open the group yourself to make sure it's not in use, and of course you can unbind as proposed in the second option. So long as you know the group is idle, somehow, it shouldn't matter what order you unbind and reprobe.
Thanks for the confirmation, Alex. What we're doing right now in libvirt (and my series extends on that) is keeping track of each device's state internally by recording information such as which guest, if any, is using it. We're mostly assuming that the user will not mess with the configuration behind our back. I guess adding more checks to make sure that's actually the case would be good, but then again, we can hardly stop the user from writing stuff into /sys :) At least now we know my series is not built upon a completely incorrect understanding of the constraints! Quick follow-up question: all of this care with IOMMU group ownership is only really applied when dealing with VFIO passthrough - are legacy KVM assignment and Xen assignment really not affected at all? Is the IOMMU not involved in the process? Cheers. -- Andrea Bolognani Software Engineer - Virtualization Team

On Thu, 2015-12-10 at 09:06 +0100, Andrea Bolognani wrote:
On Wed, 2015-12-09 at 08:09 -0700, Alex Williamson wrote:
I was under the impression that what the current series does, eg. sharing devices in the same IOMMU group between the host driver and vfio-pci is safe as long as no guest is using them at the same time, and that devices could be safely "moved" between the host driver (eg. in use) and vfio-pci (eg. idle, waiting to be assigned to a guest) as many times as desired without ill consequences.
Is my understanding wrong? Do I need to rework the series so that unbinds and reprobes are always executed across the IOMMU group?
Hi Andrea,
Your understanding is correct, so I think it just comes down to how sure you are that the vfio group is idle/unused. If there's any risk that a device is still in use by QEMU, then we haven't solved the original problem. Unbinding isn't the only way to have good confidence of this though. You could track the QEMU pid, you could use fuser to make sure the group is not in use, you could try to open the group yourself to make sure it's not in use, and of course you can unbind as proposed in the second option. So long as you know the group is idle, somehow, it shouldn't matter what order you unbind and reprobe.
Thanks for the confirmation, Alex.
What we're doing right now in libvirt (and my series extends on that) is keeping track of each device's state internally by recording information such as which guest, if any, is using it.
We're mostly assuming that the user will not mess with the configuration behind our back. I guess adding more checks to make sure that's actually the case would be good, but then again, we can hardly stop the user from writing stuff into /sys :)
At least now we know my series is not built upon a completely incorrect understanding of the constraints!
Quick follow-up question: all of this care with IOMMU group ownership is only really applied when dealing with VFIO passthrough - are legacy KVM assignment and Xen assignment really not affected at all? Is the IOMMU not involved in the process?
Both legacy KVM device assignment and (afaik) Xen device assignment ignore the issues of device isolation, and typically even iommu granularity, and allow any configuration imaginable, even when it interferes with the operation of devices owned by other guests or even host-owned devices. Thanks, Alex
participants (3)
-
Alex Williamson
-
Andrea Bolognani
-
Laine Stump