[libvirt] [PATCH 00/22] support VFIO groups

When I first put in support for VFIO device assignment, I didn't realize that groups of devices were quite as common as they actually are. In particular, I didn't know that often multiple seemingly-unrelated devices can end up in the same VFIO iommu group due to unlucky circumstances of hardware - they may share a dma controller which means that the devices can't truly be isolated from each other, and thus should not be simultaneously assigned to different guests (or even used by the host) - all of the devices in a group should be either assigned to the same guest or, if not assigned to the guest, should be isolated off in a driver to prevent them from being used by the host. The following set of patches makes setting that up easier to deal with. The end result of all the patches is the following: 1) The virNodeDevice API will be able to detach or re-attach all the devices in a particular group with a single API call. 2) <hostdev managed='yes'>, <interface type='hostdev' managed='yes'>, and <interface type='network' managed='yes'> devices (where the network is itself a pool of SRIOV Virtual Functions) can specify: <driver name='vfio' group='auto'/> and libvirt will automatically detach (and bind to the 'vfio-pci' driver for assignment/isolation) all devices in the same group as the device being assigned. Likewise, when the device it detached from the guest, a check will be made and, if none of the devices in the same group as the device being detach is still in use by a guest As usual there are a lot of patches here, but many of them are extremely simple, so don't be put off by the count. Also, please note that patches 15/22 - 18/22 provide a new public API (virNodeDeviceReAttachFlags()). I kept the new API later in the series just in case someone wanted to backport as much as possible of this patchset to an old release. Laine Stump (22): syntax: virPCIDeviceFree is also a NOP for NULL args pci: change stubDriver from const char* to char* pci: new utility functions pci: eliminate memory leak in virPCIDeviceReattach pci: make virPCIDeviceDetach consistent in behavior pci: eliminate repetitive path constructions in virPCIDeviceBindToStub pci: eliminate unused driver arg from virPCIDeviceDetach pci: update stubDriver name in virPCIDeviceBindToStub pci: rename virPCIDeviceGetVFIOGroupDev to virPCIDeviceGetIOMMUGroupDev pci: make virPCIParseDeviceAddress public pci: new iommu_group functions pci: optionally detach/reattach all devices in a VFIO group API & qemu: add ability to detach an entire VFIO group of devices virsh: add option to detach entire group of devices API: new virNodeDeviceReAttachFlags API: implement RPC calls for virNodeDeviceReAttachFlags qemu: implement virNodeDeviceReAttachFlags xen: implement virNodeDeviceReAttachFlags virsh: add option to attach entire group of devices nodedev: add iommuGroup to node device object conf: add <driver group='auto'> to hostdev, interface, and networks qemu: implement backend of <driver group='auto'/> cfg.mk | 1 + docs/formatdomain.html.in | 94 ++- docs/formatnetwork.html.in | 11 + docs/formatnode.html.in | 63 +- docs/schemas/domaincommon.rng | 16 + docs/schemas/network.rng | 8 + docs/schemas/nodedev.rng | 11 + include/libvirt/libvirt.h.in | 18 + src/conf/domain_conf.c | 36 +- src/conf/domain_conf.h | 13 + src/conf/network_conf.c | 39 +- src/conf/network_conf.h | 14 + src/conf/node_device_conf.c | 86 ++- src/conf/node_device_conf.h | 5 +- src/driver.h | 5 + src/libvirt.c | 60 ++ src/libvirt_private.syms | 11 +- src/libvirt_public.syms | 4 + src/network/bridge_driver.c | 22 + src/node_device/node_device_udev.c | 21 +- src/qemu/qemu_cgroup.c | 4 +- src/qemu/qemu_driver.c | 40 +- src/qemu/qemu_hostdev.c | 39 +- src/remote/remote_driver.c | 29 + src/remote/remote_protocol.x | 12 +- src/remote_protocol-structs | 5 + src/security/security_apparmor.c | 2 +- src/security/security_dac.c | 4 +- src/security/security_selinux.c | 4 +- src/util/virpci.c | 707 ++++++++++++++++++--- src/util/virpci.h | 37 +- src/xen/xen_driver.c | 22 +- tests/networkxml2xmlin/hostdev-pf.xml | 2 +- tests/networkxml2xmlout/hostdev-pf.xml | 2 +- tests/nodedevschemadata/pci_8086_10c9_sriov_pf.xml | 16 + tests/nodedevxml2xmltest.c | 1 + .../qemuxml2argvdata/qemuxml2argv-hostdev-vfio.xml | 2 +- .../qemuxml2argv-net-hostdev-vfio.xml | 2 +- tools/virsh-nodedev.c | 30 +- 39 files changed, 1340 insertions(+), 158 deletions(-) create mode 100644 tests/nodedevschemadata/pci_8086_10c9_sriov_pf.xml -- 1.7.11.7

add it to the syntax-check list and fix the one offending caller. --- cfg.mk | 1 + src/util/virpci.c | 3 +-- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/cfg.mk b/cfg.mk index c093bf2..bbe84b3 100644 --- a/cfg.mk +++ b/cfg.mk @@ -165,6 +165,7 @@ useless_free_options = \ --name=virNodeDeviceObjFree \ --name=virObjectUnref \ --name=virObjectFreeCallback \ + --name=virPCIDeviceFree \ --name=virSecretDefFree \ --name=virStorageEncryptionFree \ --name=virStorageEncryptionSecretFree \ diff --git a/src/util/virpci.c b/src/util/virpci.c index 1346ec1..89c1eea 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1671,8 +1671,7 @@ virPCIDeviceListDel(virPCIDeviceListPtr list, virPCIDevicePtr dev) { virPCIDevicePtr ret = virPCIDeviceListSteal(list, dev); - if (ret) - virPCIDeviceFree(ret); + virPCIDeviceFree(ret); } int -- 1.7.11.7

On Mon, Jun 24, 2013 at 05:54:50 -0400, Laine Stump wrote:
add it to the syntax-check list and fix the one offending caller. --- cfg.mk | 1 + src/util/virpci.c | 3 +-- 2 files changed, 2 insertions(+), 2 deletions(-)
ACK Jirka

On 06/24/2013 06:18 AM, Jiri Denemark wrote:
On Mon, Jun 24, 2013 at 05:54:50 -0400, Laine Stump wrote:
add it to the syntax-check list and fix the one offending caller. --- cfg.mk | 1 + src/util/virpci.c | 3 +-- 2 files changed, 2 insertions(+), 2 deletions(-) ACK
Pushed. Thanks!

Previously stubDriver was always set from a string literal, so it was okay to use a const char * that wasn't freed when the virPCIDevice was freed. This will not be the case in the near future, so it is now a char* that is allocated in virPCIDeviceSetStubDriver() and freed during virPCIDeviceFree(). --- src/qemu/qemu_driver.c | 6 ++++-- src/qemu/qemu_hostdev.c | 25 +++++++++++++++++-------- src/util/virpci.c | 8 +++++--- src/util/virpci.h | 4 ++-- src/xen/xen_driver.c | 3 ++- 5 files changed, 30 insertions(+), 16 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 3d9457f..6efec74 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -10286,9 +10286,11 @@ qemuNodeDeviceDetachFlags(virNodeDevicePtr dev, return -1; if (!driverName || STREQ(driverName, "kvm")) { - virPCIDeviceSetStubDriver(pci, "pci-stub"); + if (virPCIDeviceSetStubDriver(pci, "pci-stub") < 0) + goto out; } else if (STREQ(driverName, "vfio")) { - virPCIDeviceSetStubDriver(pci, "vfio-pci"); + if (virPCIDeviceSetStubDriver(pci, "vfio-pci") < 0) + goto out; } else { virReportError(VIR_ERR_INVALID_ARG, _("unknown driver name '%s'"), driverName); diff --git a/src/qemu/qemu_hostdev.c b/src/qemu/qemu_hostdev.c index 9013f60..7a9e6eb 100644 --- a/src/qemu/qemu_hostdev.c +++ b/src/qemu/qemu_hostdev.c @@ -70,9 +70,15 @@ qemuGetPciHostDeviceList(virDomainHostdevDefPtr *hostdevs, int nhostdevs) virPCIDeviceSetManaged(dev, hostdev->managed); if (hostdev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { - virPCIDeviceSetStubDriver(dev, "vfio-pci"); + if (virPCIDeviceSetStubDriver(dev, "vfio-pci") < 0) { + virObjectUnref(list); + return NULL; + } } else { - virPCIDeviceSetStubDriver(dev, "pci-stub"); + if (virPCIDeviceSetStubDriver(dev, "pci-stub") < 0) { + virObjectUnref(list); + return NULL; + } } } @@ -130,6 +136,7 @@ int qemuUpdateActivePciHostdevs(virQEMUDriverPtr driver, virDomainDefPtr def) { virDomainHostdevDefPtr hostdev = NULL; + virPCIDevicePtr dev = NULL; int i; int ret = -1; @@ -140,7 +147,6 @@ int qemuUpdateActivePciHostdevs(virQEMUDriverPtr driver, virObjectLock(driver->inactivePciHostdevs); for (i = 0; i < def->nhostdevs; i++) { - virPCIDevicePtr dev = NULL; hostdev = def->hostdevs[i]; if (hostdev->mode != VIR_DOMAIN_HOSTDEV_MODE_SUBSYS) @@ -159,9 +165,12 @@ int qemuUpdateActivePciHostdevs(virQEMUDriverPtr driver, virPCIDeviceSetManaged(dev, hostdev->managed); if (hostdev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { - virPCIDeviceSetStubDriver(dev, "vfio-pci"); + if (virPCIDeviceSetStubDriver(dev, "vfio-pci") < 0) + goto cleanup; } else { - virPCIDeviceSetStubDriver(dev, "pci-stub"); + if (virPCIDeviceSetStubDriver(dev, "pci-stub") < 0) + goto cleanup; + } virPCIDeviceSetUsedBy(dev, def->name); @@ -170,14 +179,14 @@ int qemuUpdateActivePciHostdevs(virQEMUDriverPtr driver, virPCIDeviceSetRemoveSlot(dev, hostdev->origstates.states.pci.remove_slot); virPCIDeviceSetReprobe(dev, hostdev->origstates.states.pci.reprobe); - if (virPCIDeviceListAdd(driver->activePciHostdevs, dev) < 0) { - virPCIDeviceFree(dev); + if (virPCIDeviceListAdd(driver->activePciHostdevs, dev) < 0) goto cleanup; - } + dev = NULL; } ret = 0; cleanup: + virPCIDeviceFree(dev); virObjectUnlock(driver->activePciHostdevs); virObjectUnlock(driver->inactivePciHostdevs); return ret; diff --git a/src/util/virpci.c b/src/util/virpci.c index 89c1eea..d00c3ee 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -69,7 +69,7 @@ struct _virPCIDevice { bool has_flr; bool has_pm_reset; bool managed; - const char *stubDriver; + char *stubDriver; /* used by reattach function */ bool unbind_from_stub; @@ -1480,6 +1480,7 @@ virPCIDeviceFree(virPCIDevicePtr dev) return; VIR_DEBUG("%s %s: freeing", dev->id, dev->name); VIR_FREE(dev->path); + VIR_FREE(dev->stubDriver); VIR_FREE(dev); } @@ -1500,10 +1501,11 @@ virPCIDeviceGetManaged(virPCIDevicePtr dev) return dev->managed; } -void +int virPCIDeviceSetStubDriver(virPCIDevicePtr dev, const char *driver) { - dev->stubDriver = driver; + VIR_FREE(dev->stubDriver); + return driver ? VIR_STRDUP(dev->stubDriver, driver) : 0; } const char * diff --git a/src/util/virpci.h b/src/util/virpci.h index 7bcadb4..17b15fe 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -62,8 +62,8 @@ int virPCIDeviceReset(virPCIDevicePtr dev, void virPCIDeviceSetManaged(virPCIDevice *dev, bool managed); unsigned int virPCIDeviceGetManaged(virPCIDevice *dev); -void virPCIDeviceSetStubDriver(virPCIDevicePtr dev, - const char *driver); +int virPCIDeviceSetStubDriver(virPCIDevicePtr dev, + const char *driver); const char *virPCIDeviceGetStubDriver(virPCIDevicePtr dev); void virPCIDeviceSetUsedBy(virPCIDevice *dev, const char *used_by); diff --git a/src/xen/xen_driver.c b/src/xen/xen_driver.c index 3efc27a..2506b8e 100644 --- a/src/xen/xen_driver.c +++ b/src/xen/xen_driver.c @@ -2237,7 +2237,8 @@ xenUnifiedNodeDeviceDetachFlags(virNodeDevicePtr dev, return -1; if (!driverName) { - virPCIDeviceSetStubDriver(pci, "pciback"); + if (virPCIDeviceSetStubDriver(pci, "pciback") < 0) + goto out; } else { virReportError(VIR_ERR_INVALID_ARG, _("unknown driver name '%s'"), driverName); -- 1.7.11.7

On Mon, Jun 24, 2013 at 05:54:51AM -0400, Laine Stump wrote:
Previously stubDriver was always set from a string literal, so it was okay to use a const char * that wasn't freed when the virPCIDevice was freed. This will not be the case in the near future, so it is now a char* that is allocated in virPCIDeviceSetStubDriver() and freed during virPCIDeviceFree(). --- src/qemu/qemu_driver.c | 6 ++++-- src/qemu/qemu_hostdev.c | 25 +++++++++++++++++-------- src/util/virpci.c | 8 +++++--- src/util/virpci.h | 4 ++-- src/xen/xen_driver.c | 3 ++- 5 files changed, 30 insertions(+), 16 deletions(-)
ACK It is bad practice to assume you can keep around a copy of any 'const char*' passed in as a parameter, since the const-ness of the parameter says nothing about whether the caller will free the parameter value. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 06/24/2013 10:55 AM, Daniel P. Berrange wrote:
On Mon, Jun 24, 2013 at 05:54:51AM -0400, Laine Stump wrote:
Previously stubDriver was always set from a string literal, so it was okay to use a const char * that wasn't freed when the virPCIDevice was freed. This will not be the case in the near future, so it is now a char* that is allocated in virPCIDeviceSetStubDriver() and freed during virPCIDeviceFree(). --- src/qemu/qemu_driver.c | 6 ++++-- src/qemu/qemu_hostdev.c | 25 +++++++++++++++++-------- src/util/virpci.c | 8 +++++--- src/util/virpci.h | 4 ++-- src/xen/xen_driver.c | 3 ++- 5 files changed, 30 insertions(+), 16 deletions(-) ACK
It is bad practice to assume you can keep around a copy of any 'const char*' passed in as a parameter, since the const-ness of the parameter says nothing about whether the caller will free the parameter value.
And this is just the tip of the iceberg with these device lists. While re-verifying the veracity of this fix, I ran across yet another case where virPCIDevices are simultaneously placed on 2 lists (through more use of this "list steal" function), and under certain circumstances would end up being leaked or (much worse) double freed (I think sometimes it's overlooked that virObjectUnref of a virPCIDeviceListPtr will free all the devices on the list, regardless of how many other places still point to them...) I'm going to try and unthread that and post a fix tonight. At any rate, this one is now pushed. Thanks!

* virPCIDeviceFindByIDs - find a device on a list w/o creating an object This makes searching for an existing device on a list lighter weight. * virPCIDeviceCopy - make a copy of an existing virPCIDevice object. * virPCIDeviceGetDriverPathAndName - construct new strings containing 1) the name of the driver bound to this device. 2) the full path to the sysfs config for that driver. (This code was lifted from virPCIDeviceUnbindFromStub, and replaced there with a call to this new function). --- src/libvirt_private.syms | 2 + src/util/virpci.c | 113 ++++++++++++++++++++++++++++++++++++++++------- src/util/virpci.h | 7 +++ 3 files changed, 105 insertions(+), 17 deletions(-) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 7ac6fdc..2fdb185 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1648,6 +1648,7 @@ virObjectUnref; # util/virpci.h virPCIDeviceAddressGetSysfsFile; +virPCIDeviceCopy; virPCIDeviceDetach; virPCIDeviceFileIterate; virPCIDeviceFree; @@ -1664,6 +1665,7 @@ virPCIDeviceListAdd; virPCIDeviceListCount; virPCIDeviceListDel; virPCIDeviceListFind; +virPCIDeviceListFindByIDs; virPCIDeviceListFindIndex; virPCIDeviceListGet; virPCIDeviceListNew; diff --git a/src/util/virpci.c b/src/util/virpci.c index d00c3ee..10e95bd 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -880,6 +880,54 @@ virPCIFile(char **buffer, const char *device, const char *file) return 0; } + +/* virPCIDeviceGetDriverPathAndName - put the path to the driver + * directory of the driver in use for this device in @path and the + * name of the driver in @name. Both could be NULL if it's not bound + * to any driver. + * + * Return 0 for success, -1 for error. + */ +static int +virPCIDeviceGetDriverPathAndName(virPCIDevicePtr dev, char **path, char **name) +{ + int ret = -1; + char *drvlink = NULL; + + *path = *name = NULL; + /* drvlink = "/sys/bus/pci/dddd:bb:ss.ff/driver" */ + if (virPCIFile(&drvlink, dev->name, "driver") < 0) + goto cleanup; + + if (virFileIsLink(drvlink) != 1) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Invalid device %s driver file %s is not a symlink"), + dev->name, drvlink); + goto cleanup; + } + if (virFileResolveLink(drvlink, path) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unable to resolve device %s driver symlink %s"), + dev->name, drvlink); + goto cleanup; + } + /* path = "/sys/bus/pci/drivers/${drivername}" */ + + if (VIR_STRDUP(*name, last_component(*path)) < 0) + goto cleanup; + /* name = "${drivername}" */ + + ret = 0; +cleanup: + VIR_FREE(drvlink); + if (ret < 0) { + VIR_FREE(*path); + VIR_FREE(*name); + } + return ret; +} + + static int virPCIProbeStubDriver(const char *driver) { @@ -931,23 +979,7 @@ virPCIDeviceUnbindFromStub(virPCIDevicePtr dev) /* If the device is currently bound to one of the "well known" * stub drivers, then unbind it, otherwise ignore it. */ - if (virPCIFile(&path, dev->name, "driver") < 0) - goto cleanup; - /* path = "/sys/bus/pci/dddd:bb:ss.ff/driver" */ - if (virFileIsLink(path) != 1) { - virReportError(VIR_ERR_INTERNAL_ERROR, - _("Invalid device %s driver file %s is not a symlink"), - dev->name, path); - goto cleanup; - } - if (virFileResolveLink(path, &drvdir) < 0) { - virReportError(VIR_ERR_INTERNAL_ERROR, - _("Unable to resolve device %s driver symlink %s"), - dev->name, path); - goto cleanup; - } - /* drvdir = "/sys/bus/pci/drivers/${drivername}" */ - if (VIR_STRDUP(driver, last_component(drvdir)) < 0) + if (virPCIDeviceGetDriverPathAndName(dev, &drvdir, &driver) < 0) goto cleanup; if (!dev->unbind_from_stub) @@ -1473,6 +1505,32 @@ error: goto cleanup; } + +virPCIDevicePtr +virPCIDeviceCopy(virPCIDevicePtr dev) +{ + virPCIDevicePtr copy; + + if (VIR_ALLOC(copy) < 0) { + virReportOOMError(); + return NULL; + } + + /* shallow copy to take care of most attributes */ + *copy = *dev; + copy->path = copy->stubDriver = NULL; + if (VIR_STRDUP(copy->path, dev->path) < 0 || + VIR_STRDUP(copy->stubDriver, dev->stubDriver) < 0) { + goto error; + } + return copy; + +error: + virPCIDeviceFree(copy); + return NULL; +} + + void virPCIDeviceFree(virPCIDevicePtr dev) { @@ -1690,6 +1748,27 @@ virPCIDeviceListFindIndex(virPCIDeviceListPtr list, virPCIDevicePtr dev) return -1; } + +virPCIDevicePtr +virPCIDeviceListFindByIDs(virPCIDeviceListPtr list, + unsigned int domain, + unsigned int bus, + unsigned int slot, + unsigned int function) +{ + int i; + + for (i = 0; i < list->count; i++) { + if (list->devs[i]->domain == domain && + list->devs[i]->bus == bus && + list->devs[i]->slot == slot && + list->devs[i]->function == function) + return list->devs[i]; + } + return NULL; +} + + virPCIDevicePtr virPCIDeviceListFind(virPCIDeviceListPtr list, virPCIDevicePtr dev) { diff --git a/src/util/virpci.h b/src/util/virpci.h index 17b15fe..d069adb 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -45,6 +45,7 @@ virPCIDevicePtr virPCIDeviceNew(unsigned int domain, unsigned int bus, unsigned int slot, unsigned int function); +virPCIDevicePtr virPCIDeviceCopy(virPCIDevicePtr dev); void virPCIDeviceFree(virPCIDevicePtr dev); const char *virPCIDeviceGetName(virPCIDevicePtr dev); @@ -94,6 +95,12 @@ void virPCIDeviceListDel(virPCIDeviceListPtr list, virPCIDevicePtr dev); virPCIDevicePtr virPCIDeviceListFind(virPCIDeviceListPtr list, virPCIDevicePtr dev); +virPCIDevicePtr +virPCIDeviceListFindByIDs(virPCIDeviceListPtr list, + unsigned int domain, + unsigned int bus, + unsigned int slot, + unsigned int function); int virPCIDeviceListFindIndex(virPCIDeviceListPtr list, virPCIDevicePtr dev); -- 1.7.11.7

On Mon, Jun 24, 2013 at 05:54:52AM -0400, Laine Stump wrote:
* virPCIDeviceFindByIDs - find a device on a list w/o creating an object This makes searching for an existing device on a list lighter weight.
* virPCIDeviceCopy - make a copy of an existing virPCIDevice object.
* virPCIDeviceGetDriverPathAndName - construct new strings containing 1) the name of the driver bound to this device. 2) the full path to the sysfs config for that driver. (This code was lifted from virPCIDeviceUnbindFromStub, and replaced there with a call to this new function). --- src/libvirt_private.syms | 2 + src/util/virpci.c | 113 ++++++++++++++++++++++++++++++++++++++++------- src/util/virpci.h | 7 +++ 3 files changed, 105 insertions(+), 17 deletions(-)
ACK Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 06/24/2013 10:58 AM, Daniel P. Berrange wrote:
On Mon, Jun 24, 2013 at 05:54:52AM -0400, Laine Stump wrote:
* virPCIDeviceFindByIDs - find a device on a list w/o creating an object This makes searching for an existing device on a list lighter weight.
* virPCIDeviceCopy - make a copy of an existing virPCIDevice object.
* virPCIDeviceGetDriverPathAndName - construct new strings containing 1) the name of the driver bound to this device. 2) the full path to the sysfs config for that driver. (This code was lifted from virPCIDeviceUnbindFromStub, and replaced there with a call to this new function). --- src/libvirt_private.syms | 2 + src/util/virpci.c | 113 ++++++++++++++++++++++++++++++++++++++++------- src/util/virpci.h | 7 +++ 3 files changed, 105 insertions(+), 17 deletions(-) ACK
Pushed. Thanks!

virPCIDeviceReattach was making the assumption that the dev object given to it was one and the same with the dev object on the inactiveDevs list. If that had been the case, it would not need to free the dev object it removed from the inactive list, because the caller of virPCIDeviceReattach always frees the dev object that it passes in. Since the dev object passed in is *never* the same object that's on the list (it is a different object with the same name and attributes, created just for the purpose of searching for the actual object), simply doing a "ListSteal" to remove the object from the list results in one leaked object; we need to actually free the object after removing it from the list. --- src/util/virpci.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/util/virpci.c b/src/util/virpci.c index 10e95bd..1108ef2 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1266,7 +1266,7 @@ virPCIDeviceReattach(virPCIDevicePtr dev, /* Steal the dev from list inactiveDevs */ if (inactiveDevs) - virPCIDeviceListSteal(inactiveDevs, dev); + virPCIDeviceListDel(inactiveDevs, dev); return 0; } -- 1.7.11.7

On Mon, Jun 24, 2013 at 05:54:53AM -0400, Laine Stump wrote:
virPCIDeviceReattach was making the assumption that the dev object given to it was one and the same with the dev object on the inactiveDevs list. If that had been the case, it would not need to free the dev object it removed from the inactive list, because the caller of virPCIDeviceReattach always frees the dev object that it passes in. Since the dev object passed in is *never* the same object that's on the list (it is a different object with the same name and attributes, created just for the purpose of searching for the actual object), simply doing a "ListSteal" to remove the object from the list results in one leaked object; we need to actually free the object after removing it from the list. --- src/util/virpci.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/util/virpci.c b/src/util/virpci.c index 10e95bd..1108ef2 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1266,7 +1266,7 @@ virPCIDeviceReattach(virPCIDevicePtr dev,
/* Steal the dev from list inactiveDevs */ if (inactiveDevs) - virPCIDeviceListSteal(inactiveDevs, dev); + virPCIDeviceListDel(inactiveDevs, dev);
return 0; }
ACK Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 06/24/2013 10:57 AM, Daniel P. Berrange wrote:
On Mon, Jun 24, 2013 at 05:54:53AM -0400, Laine Stump wrote:
virPCIDeviceReattach was making the assumption that the dev object given to it was one and the same with the dev object on the inactiveDevs list. If that had been the case, it would not need to free the dev object it removed from the inactive list, because the caller of virPCIDeviceReattach always frees the dev object that it passes in. Since the dev object passed in is *never* the same object that's on the list (it is a different object with the same name and attributes, created just for the purpose of searching for the actual object), simply doing a "ListSteal" to remove the object from the list results in one leaked object; we need to actually free the object after removing it from the list. --- src/util/virpci.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/util/virpci.c b/src/util/virpci.c index 10e95bd..1108ef2 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1266,7 +1266,7 @@ virPCIDeviceReattach(virPCIDevicePtr dev,
/* Steal the dev from list inactiveDevs */ if (inactiveDevs) - virPCIDeviceListSteal(inactiveDevs, dev); + virPCIDeviceListDel(inactiveDevs, dev);
return 0; } ACK
Pushed. Thanks!

virPCIDeviceDetach would previously sometimes consume the input device object (to put it on the inactive list) and sometimes not. Avoiding memory leaks required checking beforehand to see if the device was already on the list, and freeing the device object in the caller only if there wasn't already an identical object on the inactive list. This patch makes it consistent - virPCIDeviceDetach will *never* consume the input virPCIDevice object; if it needs to put one on the inactive list, it will create a copy and put *that* on the list. This way the caller knows that it is always their responsibility to free the device object they created. --- src/qemu/qemu_driver.c | 8 +++----- src/util/virpci.c | 26 ++++++++++++++++++++++++-- 2 files changed, 27 insertions(+), 7 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 6efec74..20edf45 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -10274,7 +10274,6 @@ qemuNodeDeviceDetachFlags(virNodeDevicePtr dev, virPCIDevicePtr pci; unsigned domain, bus, slot, function; int ret = -1; - bool in_inactive_list = false; virCheckFlags(0, -1); @@ -10299,18 +10298,17 @@ qemuNodeDeviceDetachFlags(virNodeDevicePtr dev, virObjectLock(driver->activePciHostdevs); virObjectLock(driver->inactivePciHostdevs); - in_inactive_list = virPCIDeviceListFind(driver->inactivePciHostdevs, pci); if (virPCIDeviceDetach(pci, driver->activePciHostdevs, - driver->inactivePciHostdevs, NULL) < 0) + driver->inactivePciHostdevs, NULL) < 0) { goto out; + } ret = 0; out: virObjectUnlock(driver->inactivePciHostdevs); virObjectUnlock(driver->activePciHostdevs); - if (in_inactive_list) - virPCIDeviceFree(pci); + virPCIDeviceFree(pci); return ret; } diff --git a/src/util/virpci.c b/src/util/virpci.c index 1108ef2..2f4032f 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1217,6 +1217,24 @@ cleanup: return result; } +/* virPCIDeviceDetach: + * + * Detach this device from the host driver, attach it to the stub + * driver (previously set with virPCIDeviceSetStubDriver(), and add *a + * copy* of the object to the inactiveDevs list (if provided). This + * function will *never* consume dev, so the caller should free it. + * + * Returns 0 on success, -1 on failure (will fail if the device is + * already in the activeDevs list, but will be a NOP if the device is + * already bound to the stub). + * + * GENERAL NOTE: activeDevs should be a list of all PCI devices + * currently in use by a domain. inactiveDevs is a list of all PCI + * devices that libvirt has detached from the host driver + attached + * to the stub driver, but hasn't yet assigned to a domain. Any device + * that is still attached to its host driver should not be on either + * list. + */ int virPCIDeviceDetach(virPCIDevicePtr dev, virPCIDeviceList *activeDevs, @@ -1241,9 +1259,13 @@ virPCIDeviceDetach(virPCIDevicePtr dev, if (virPCIDeviceBindToStub(dev, driver) < 0) return -1; - /* Add the dev into list inactiveDevs */ + /* Add *a copy of* the dev into list inactiveDevs, if + * it's not already there. + */ if (inactiveDevs && !virPCIDeviceListFind(inactiveDevs, dev)) { - if (virPCIDeviceListAdd(inactiveDevs, dev) < 0) + virPCIDevicePtr copy = virPCIDeviceCopy(dev); + + if ((!copy) || virPCIDeviceListAdd(inactiveDevs, copy) < 0) return -1; } -- 1.7.11.7

On Mon, Jun 24, 2013 at 05:54:54AM -0400, Laine Stump wrote:
virPCIDeviceDetach would previously sometimes consume the input device object (to put it on the inactive list) and sometimes not. Avoiding memory leaks required checking beforehand to see if the device was already on the list, and freeing the device object in the caller only if there wasn't already an identical object on the inactive list.
This patch makes it consistent - virPCIDeviceDetach will *never* consume the input virPCIDevice object; if it needs to put one on the inactive list, it will create a copy and put *that* on the list. This way the caller knows that it is always their responsibility to free the device object they created. --- src/qemu/qemu_driver.c | 8 +++----- src/util/virpci.c | 26 ++++++++++++++++++++++++-- 2 files changed, 27 insertions(+), 7 deletions(-)
ACK much safer semantics. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 06/24/2013 10:57 AM, Daniel P. Berrange wrote:
On Mon, Jun 24, 2013 at 05:54:54AM -0400, Laine Stump wrote:
virPCIDeviceDetach would previously sometimes consume the input device object (to put it on the inactive list) and sometimes not. Avoiding memory leaks required checking beforehand to see if the device was already on the list, and freeing the device object in the caller only if there wasn't already an identical object on the inactive list.
This patch makes it consistent - virPCIDeviceDetach will *never* consume the input virPCIDevice object; if it needs to put one on the inactive list, it will create a copy and put *that* on the list. This way the caller knows that it is always their responsibility to free the device object they created. --- src/qemu/qemu_driver.c | 8 +++----- src/util/virpci.c | 26 ++++++++++++++++++++++++-- 2 files changed, 27 insertions(+), 7 deletions(-) ACK much safer semantics.
I'm frankly surprised we don't have more crashes in this code. I've pushed this now.

The same strings were being re-created multiple times just to save declaring a new variable. In the meantime, the use of the generic variable names led to confusion when trying to follow the code. This patch creates strings for: stubDriverName (was called "driver" in original args) stubDriverPath ("/sys/bus/pci/drivers/${stubDriverName}") driverLink ("${device}/driver") oldDriverName (the final component of path linked to by "${device}/driver") oldDriverPath ("/sys/bus/pci/drivers/${oldDriverName}") then re-uses them as necessary. --- src/util/virpci.c | 73 ++++++++++++++++++++++++++++--------------------------- 1 file changed, 37 insertions(+), 36 deletions(-) diff --git a/src/util/virpci.c b/src/util/virpci.c index 2f4032f..0074af3 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1068,22 +1068,30 @@ cleanup: static int -virPCIDeviceBindToStub(virPCIDevicePtr dev, const char *driver) +virPCIDeviceBindToStub(virPCIDevicePtr dev, + const char *stubDriverName) { int result = -1; - char *drvdir = NULL; - char *path = NULL; int reprobe = false; - /* check whether the device is already bound to a driver */ - if (virPCIDriverDir(&drvdir, driver) < 0 || - virPCIFile(&path, dev->name, "driver") < 0) { + char *stubDriverPath = NULL; + char *driverLink = NULL; + char *oldDriverPath = NULL; + char *oldDriverName = NULL; + char *path = NULL; /* reused for different purposes */ + + if (virPCIDriverDir(&stubDriverPath, stubDriverName) < 0 || + virPCIFile(&driverLink, dev->name, "driver") < 0 || + virPCIDeviceGetDriverPathAndName(dev, &oldDriverPath, + &oldDriverName) < 0) { goto cleanup; } - if (virFileExists(path)) { - if (virFileLinkPointsTo(path, drvdir)) { - /* The device is already bound to pci-stub */ + if (virFileExists(driverLink)) { + if (virFileLinkPointsTo(driverLink, stubDriverPath)) { + /* The device is already bound to the correct driver */ + VIR_DEBUG("Device %s is already bound to %s", + dev->name, stubDriverName); result = 0; goto cleanup; } @@ -1098,26 +1106,21 @@ virPCIDeviceBindToStub(virPCIDevicePtr dev, const char *driver) * is triggered for such a device, it will also be immediately * bound by the stub. */ - if (virPCIDriverFile(&path, driver, "new_id") < 0) { + if (virPCIDriverFile(&path, stubDriverName, "new_id") < 0) { goto cleanup; } if (virFileWriteStr(path, dev->id, 0) < 0) { virReportSystemError(errno, _("Failed to add PCI device ID '%s' to %s"), - dev->id, driver); + dev->id, stubDriverName); goto cleanup; } /* check whether the device is bound to pci-stub when we write dev->id to - * new_id. + * ${stubDriver}/new_id. */ - if (virPCIDriverDir(&drvdir, driver) < 0 || - virPCIFile(&path, dev->name, "driver") < 0) { - goto remove_id; - } - - if (virFileLinkPointsTo(path, drvdir)) { + if (virFileLinkPointsTo(driverLink, stubDriverPath)) { dev->unbind_from_stub = true; dev->remove_slot = true; goto remove_id; @@ -1144,33 +1147,29 @@ virPCIDeviceBindToStub(virPCIDevicePtr dev, const char *driver) /* If the device isn't already bound to pci-stub, try binding it now. */ - if (virPCIDriverDir(&drvdir, driver) < 0 || - virPCIFile(&path, dev->name, "driver") < 0) { - goto remove_id; - } - - if (!virFileLinkPointsTo(path, drvdir)) { + if (!virFileLinkPointsTo(driverLink, stubDriverPath)) { /* Xen's pciback.ko wants you to use new_slot first */ - if (virPCIDriverFile(&path, driver, "new_slot") < 0) { + if (virPCIDriverFile(&path, stubDriverName, "new_slot") < 0) { goto remove_id; } if (virFileExists(path) && virFileWriteStr(path, dev->name, 0) < 0) { virReportSystemError(errno, - _("Failed to add slot for PCI device '%s' to %s"), - dev->name, driver); + _("Failed to add slot for " + "PCI device '%s' to %s"), + dev->name, stubDriverName); goto remove_id; } dev->remove_slot = true; - if (virPCIDriverFile(&path, driver, "bind") < 0) { + if (virPCIDriverFile(&path, stubDriverName, "bind") < 0) { goto remove_id; } if (virFileWriteStr(path, dev->name, 0) < 0) { virReportSystemError(errno, _("Failed to bind PCI device '%s' to %s"), - dev->name, driver); + dev->name, stubDriverName); goto remove_id; } dev->unbind_from_stub = true; @@ -1180,11 +1179,11 @@ remove_id: /* If 'remove_id' exists, remove the device id from pci-stub's dynamic * ID table so that 'drivers_probe' works below. */ - if (virPCIDriverFile(&path, driver, "remove_id") < 0) { + if (virPCIDriverFile(&path, stubDriverName, "remove_id") < 0) { /* We do not remove PCI ID from pci-stub, and we cannot reprobe it */ if (dev->reprobe) { VIR_WARN("Could not remove PCI ID '%s' from %s, and the device " - "cannot be probed again.", dev->id, driver); + "cannot be probed again.", dev->id, stubDriverName); } dev->reprobe = false; goto cleanup; @@ -1193,12 +1192,12 @@ remove_id: if (virFileExists(path) && virFileWriteStr(path, dev->id, 0) < 0) { virReportSystemError(errno, _("Failed to remove PCI ID '%s' from %s"), - dev->id, driver); + dev->id, stubDriverName); /* remove PCI ID from pci-stub failed, and we cannot reprobe it */ if (dev->reprobe) { VIR_WARN("Failed to remove PCI ID '%s' from %s, and the device " - "cannot be probed again.", dev->id, driver); + "cannot be probed again.", dev->id, stubDriverName); } dev->reprobe = false; goto cleanup; @@ -1207,12 +1206,14 @@ remove_id: result = 0; cleanup: - VIR_FREE(drvdir); + VIR_FREE(stubDriverPath); + VIR_FREE(driverLink); + VIR_FREE(oldDriverPath); + VIR_FREE(oldDriverName); VIR_FREE(path); - if (result < 0) { + if (result < 0) virPCIDeviceUnbindFromStub(dev); - } return result; } -- 1.7.11.7

The driver arg to virPCIDeviceDetach is no longer used (the name of the stub driver is now set in the virPCIDevice object, and virPCIDeviceDetach retrieves it from there). Remove it. --- src/qemu/qemu_driver.c | 2 +- src/qemu/qemu_hostdev.c | 2 +- src/util/virpci.c | 13 +++++-------- src/util/virpci.h | 3 +-- src/xen/xen_driver.c | 2 +- 5 files changed, 9 insertions(+), 13 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 20edf45..20b127c 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -10300,7 +10300,7 @@ qemuNodeDeviceDetachFlags(virNodeDevicePtr dev, virObjectLock(driver->inactivePciHostdevs); if (virPCIDeviceDetach(pci, driver->activePciHostdevs, - driver->inactivePciHostdevs, NULL) < 0) { + driver->inactivePciHostdevs) < 0) { goto out; } diff --git a/src/qemu/qemu_hostdev.c b/src/qemu/qemu_hostdev.c index 7a9e6eb..dfe39c6 100644 --- a/src/qemu/qemu_hostdev.c +++ b/src/qemu/qemu_hostdev.c @@ -540,7 +540,7 @@ int qemuPrepareHostdevPCIDevices(virQEMUDriverPtr driver, for (i = 0; i < virPCIDeviceListCount(pcidevs); i++) { virPCIDevicePtr dev = virPCIDeviceListGet(pcidevs, i); if (virPCIDeviceGetManaged(dev) && - virPCIDeviceDetach(dev, driver->activePciHostdevs, NULL, NULL) < 0) + virPCIDeviceDetach(dev, driver->activePciHostdevs, NULL) < 0) goto reattachdevs; } diff --git a/src/util/virpci.c b/src/util/virpci.c index 0074af3..a53912d 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1239,15 +1239,12 @@ cleanup: int virPCIDeviceDetach(virPCIDevicePtr dev, virPCIDeviceList *activeDevs, - virPCIDeviceList *inactiveDevs, - const char *driver) + virPCIDeviceList *inactiveDevs) { - if (!driver && dev->stubDriver) - driver = dev->stubDriver; - - if (virPCIProbeStubDriver(driver) < 0) { + if (virPCIProbeStubDriver(dev->stubDriver) < 0) { virReportError(VIR_ERR_INTERNAL_ERROR, - _("Failed to load PCI stub module %s"), driver); + _("Failed to load PCI stub module %s"), + dev->stubDriver); return -1; } @@ -1257,7 +1254,7 @@ virPCIDeviceDetach(virPCIDevicePtr dev, return -1; } - if (virPCIDeviceBindToStub(dev, driver) < 0) + if (virPCIDeviceBindToStub(dev, dev->stubDriver) < 0) return -1; /* Add *a copy of* the dev into list inactiveDevs, if diff --git a/src/util/virpci.h b/src/util/virpci.h index d069adb..944aa09 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -51,8 +51,7 @@ const char *virPCIDeviceGetName(virPCIDevicePtr dev); int virPCIDeviceDetach(virPCIDevicePtr dev, virPCIDeviceListPtr activeDevs, - virPCIDeviceListPtr inactiveDevs, - const char *driver); + virPCIDeviceListPtr inactiveDevs); int virPCIDeviceReattach(virPCIDevicePtr dev, virPCIDeviceListPtr activeDevs, virPCIDeviceListPtr inactiveDevs); diff --git a/src/xen/xen_driver.c b/src/xen/xen_driver.c index 2506b8e..6724a36 100644 --- a/src/xen/xen_driver.c +++ b/src/xen/xen_driver.c @@ -2245,7 +2245,7 @@ xenUnifiedNodeDeviceDetachFlags(virNodeDevicePtr dev, goto out; } - if (virPCIDeviceDetach(pci, NULL, NULL, NULL) < 0) + if (virPCIDeviceDetach(pci, NULL, NULL) < 0) goto out; ret = 0; -- 1.7.11.7

If the device is bound to a stub driver different from what is saved in the virPCIDevice's stubDriver attribute, update it. --- src/util/virpci.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/src/util/virpci.c b/src/util/virpci.c index a53912d..2a77e77 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1079,6 +1079,7 @@ virPCIDeviceBindToStub(virPCIDevicePtr dev, char *oldDriverPath = NULL; char *oldDriverName = NULL; char *path = NULL; /* reused for different purposes */ + const char *newDriverName = NULL; if (virPCIDriverDir(&stubDriverPath, stubDriverName) < 0 || virPCIFile(&driverLink, dev->name, "driver") < 0 || @@ -1092,6 +1093,7 @@ virPCIDeviceBindToStub(virPCIDevicePtr dev, /* The device is already bound to the correct driver */ VIR_DEBUG("Device %s is already bound to %s", dev->name, stubDriverName); + newDriverName = stubDriverName; result = 0; goto cleanup; } @@ -1203,6 +1205,7 @@ remove_id: goto cleanup; } + newDriverName = stubDriverName; result = 0; cleanup: @@ -1212,6 +1215,11 @@ cleanup: VIR_FREE(oldDriverName); VIR_FREE(path); + if (newDriverName && + STRNEQ_NULLABLE(dev->stubDriver, newDriverName)) { + VIR_FREE(dev->stubDriver); + result = VIR_STRDUP(dev->stubDriver, newDriverName); + } if (result < 0) virPCIDeviceUnbindFromStub(dev); -- 1.7.11.7

I realized after the fact that it's probably better in the long run to give this function a name that matches the name of the link used in sysfs to hold the group (iommu_group). I'm changing it now because I'm about to add several more functions that deal with iommu groups. --- src/libvirt_private.syms | 2 +- src/qemu/qemu_cgroup.c | 4 ++-- src/security/security_apparmor.c | 2 +- src/security/security_dac.c | 4 ++-- src/security/security_selinux.c | 4 ++-- src/util/virpci.c | 6 +++--- src/util/virpci.h | 2 +- 7 files changed, 12 insertions(+), 12 deletions(-) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 2fdb185..b7a59ec 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1652,6 +1652,7 @@ virPCIDeviceCopy; virPCIDeviceDetach; virPCIDeviceFileIterate; virPCIDeviceFree; +virPCIDeviceGetIOMMUGroupDev; virPCIDeviceGetManaged; virPCIDeviceGetName; virPCIDeviceGetRemoveSlot; @@ -1659,7 +1660,6 @@ virPCIDeviceGetReprobe; virPCIDeviceGetStubDriver; virPCIDeviceGetUnbindFromStub; virPCIDeviceGetUsedBy; -virPCIDeviceGetVFIOGroupDev; virPCIDeviceIsAssignable; virPCIDeviceListAdd; virPCIDeviceListCount; diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index cf46993..5f54ca6 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -277,7 +277,7 @@ qemuSetupHostdevCGroup(virDomainObjPtr vm, if (!pci) goto cleanup; - if (!(path = virPCIDeviceGetVFIOGroupDev(pci))) + if (!(path = virPCIDeviceGetIOMMUGroupDev(pci))) goto cleanup; VIR_DEBUG("Cgroup allow %s for PCI device assignment", path); @@ -376,7 +376,7 @@ qemuTeardownHostdevCgroup(virDomainObjPtr vm, if (!pci) goto cleanup; - if (!(path = virPCIDeviceGetVFIOGroupDev(pci))) + if (!(path = virPCIDeviceGetIOMMUGroupDev(pci))) goto cleanup; VIR_DEBUG("Cgroup deny %s for PCI device assignment", path); diff --git a/src/security/security_apparmor.c b/src/security/security_apparmor.c index 87c2777..50b0e74 100644 --- a/src/security/security_apparmor.c +++ b/src/security/security_apparmor.c @@ -823,7 +823,7 @@ AppArmorSetSecurityHostdevLabel(virSecurityManagerPtr mgr, if (dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { - char *vfioGroupDev = virPCIDeviceGetVFIOGroupDev(pci); + char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); if (!vfioGroupDev) { virPCIDeviceFree(pci); diff --git a/src/security/security_dac.c b/src/security/security_dac.c index b8d1a92..0d6defc 100644 --- a/src/security/security_dac.c +++ b/src/security/security_dac.c @@ -529,7 +529,7 @@ virSecurityDACSetSecurityHostdevLabel(virSecurityManagerPtr mgr, if (dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { - char *vfioGroupDev = virPCIDeviceGetVFIOGroupDev(pci); + char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); if (!vfioGroupDev) { virPCIDeviceFree(pci); @@ -648,7 +648,7 @@ virSecurityDACRestoreSecurityHostdevLabel(virSecurityManagerPtr mgr, if (dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { - char *vfioGroupDev = virPCIDeviceGetVFIOGroupDev(pci); + char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); if (!vfioGroupDev) { virPCIDeviceFree(pci); diff --git a/src/security/security_selinux.c b/src/security/security_selinux.c index 6fe063e..7802dda 100644 --- a/src/security/security_selinux.c +++ b/src/security/security_selinux.c @@ -1333,7 +1333,7 @@ virSecuritySELinuxSetSecurityHostdevSubsysLabel(virDomainDefPtr def, if (dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { - char *vfioGroupDev = virPCIDeviceGetVFIOGroupDev(pci); + char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); if (!vfioGroupDev) { virPCIDeviceFree(pci); @@ -1528,7 +1528,7 @@ virSecuritySELinuxRestoreSecurityHostdevSubsysLabel(virSecurityManagerPtr mgr, if (dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { - char *vfioGroupDev = virPCIDeviceGetVFIOGroupDev(pci); + char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); if (!vfioGroupDev) { virPCIDeviceFree(pci); diff --git a/src/util/virpci.c b/src/util/virpci.c index 2a77e77..23cff09 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1861,11 +1861,11 @@ cleanup: return ret; } -/* virPCIDeviceGetVFIOGroupDev - return the name of the device used to - * control this PCI device's group (e.g. "/dev/vfio/15") +/* virPCIDeviceGetIOMMUGroupDev - return the name of the device used + * to control this PCI device's group (e.g. "/dev/vfio/15") */ char * -virPCIDeviceGetVFIOGroupDev(virPCIDevicePtr dev) +virPCIDeviceGetIOMMUGroupDev(virPCIDevicePtr dev) { char *devPath = NULL; char *groupPath = NULL; diff --git a/src/util/virpci.h b/src/util/virpci.h index 944aa09..89717b8 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -117,7 +117,7 @@ int virPCIDeviceFileIterate(virPCIDevicePtr dev, virPCIDeviceFileActor actor, void *opaque); char * -virPCIDeviceGetVFIOGroupDev(virPCIDevicePtr dev); +virPCIDeviceGetIOMMUGroupDev(virPCIDevicePtr dev); int virPCIDeviceIsAssignable(virPCIDevicePtr dev, int strict_acs_check); -- 1.7.11.7

This function has utility outside of virpci.c, so make it public. --- src/libvirt_private.syms | 1 + src/util/virpci.c | 2 +- src/util/virpci.h | 2 ++ 3 files changed, 4 insertions(+), 1 deletion(-) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index b7a59ec..2f8f202 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1688,6 +1688,7 @@ virPCIGetVirtualFunctionIndex; virPCIGetVirtualFunctionInfo; virPCIGetVirtualFunctions; virPCIIsVirtualFunction; +virPCIParseDeviceAddress; # util/virpidfile.h diff --git a/src/util/virpci.c b/src/util/virpci.c index 23cff09..c34f872 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -2055,7 +2055,7 @@ logStrToLong_ui(char const *s, return ret; } -static int +int virPCIParseDeviceAddress(char *address, virPCIDeviceAddressPtr bdf) { diff --git a/src/util/virpci.h b/src/util/virpci.h index 89717b8..bcf1b81 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -152,6 +152,8 @@ int virPCIGetAddrString(unsigned int domain, char **pciConfigAddr) ATTRIBUTE_NONNULL(5) ATTRIBUTE_RETURN_CHECK; +int virPCIParseDeviceAddress(char *address, virPCIDeviceAddressPtr bdf); + int virPCIGetVirtualFunctionInfo(const char *vf_sysfs_device_path, char **pfname, int *vf_index); -- 1.7.11.7

Any device which belongs to an "IOMMU group" (used by vfio) will have links to all devices of its group listed in /sys/bus/pci/$device/iommu_group/devices; /sys/bus/pci/$device/iommu_group is actually a link to /sys/kernel/iommu_groups/$n, where $n is the group number (there will be a corresponding device node at /dev/vfio/$n once the devices are bound to the vfio-pci driver) The following functions are added: virPCIDeviceGetIOMMUGroupList Gets a virPCIDeviceList with one virPCIDeviceList for each device in the same IOMMU group as the provided virPCIDevice (a copy of the original device object is included in the list. virPCIIOMMUGroupIterate Calls the function @actor once for each device in the group that contains the given virPCIDeviceAddress. virPCIGetIOMMUGroupAddresses Fills in a virPCIDeviceAddressPtr * with an array of virPCIDeviceAddress, one for each device in the iommu group of the provided virPCIDeviceAddress (including a copy of the original). virPCIGetIOMMUGroupNum Returns the group number as an int (a valid group number will always be 0 or greater). If there is no iommu_group link in the device's directory (usually indicating that vfio isn't loaded), -2 will be returned. On any real error, -1 will be returned. --- src/libvirt_private.syms | 4 + src/util/virpci.c | 217 +++++++++++++++++++++++++++++++++++++++++++++++ src/util/virpci.h | 15 +++- 3 files changed, 233 insertions(+), 3 deletions(-) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 2f8f202..dd3a138 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1653,6 +1653,7 @@ virPCIDeviceDetach; virPCIDeviceFileIterate; virPCIDeviceFree; virPCIDeviceGetIOMMUGroupDev; +virPCIDeviceGetIOMMUGroupList; virPCIDeviceGetManaged; virPCIDeviceGetName; virPCIDeviceGetRemoveSlot; @@ -1682,11 +1683,14 @@ virPCIDeviceSetStubDriver; virPCIDeviceSetUnbindFromStub; virPCIDeviceSetUsedBy; virPCIDeviceWaitForCleanup; +virPCIGetIOMMUGroupAddresses; +virPCIGetIOMMUGroupNum; virPCIGetNetName; virPCIGetPhysicalFunction; virPCIGetVirtualFunctionIndex; virPCIGetVirtualFunctionInfo; virPCIGetVirtualFunctions; +virPCIIOMMUGroupIterate; virPCIIsVirtualFunction; virPCIParseDeviceAddress; diff --git a/src/util/virpci.c b/src/util/virpci.c index c34f872..cbade1b 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -1861,6 +1861,223 @@ cleanup: return ret; } + +/* virPCIIOMMUGroupIterate: + * Call @actor for all devices in the same iommu_group as orig + * (including orig itself) Even if there is no iommu_group for the + * device, call @actor once for orig. + */ +int +virPCIIOMMUGroupIterate(virPCIDeviceAddressPtr orig, + virPCIDeviceAddressActor actor, + void *opaque) +{ + char *groupPath = NULL; + DIR *groupDir = NULL; + int ret = -1; + struct dirent *ent; + + if (virAsprintf(&groupPath, + PCI_SYSFS "devices/%04x:%02x:%02x.%x/iommu_group/devices", + orig->domain, orig->bus, orig->slot, orig->function) < 0) { + virReportOOMError(); + goto cleanup; + } + + if (!(groupDir = opendir(groupPath))) { + /* just process the original device, nothing more */ + ret = (actor)(orig, opaque); + goto cleanup; + } + + while ((ent = readdir(groupDir)) != NULL) { + virPCIDeviceAddress newDev; + + if (ent->d_name[0] == '.') + continue; + + if (virPCIParseDeviceAddress(ent->d_name, &newDev) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Found invalid device link '%s' in '%s'"), + ent->d_name, groupPath); + goto cleanup; + } + + if ((actor)(&newDev, opaque) < 0) + goto cleanup; + } + + ret = 0; + +cleanup: + VIR_FREE(groupPath); + if (groupDir) + closedir(groupDir); + return ret; +} + + +static int +virPCIDeviceGetIOMMUGroupAddOne(virPCIDeviceAddressPtr newDevAddr, void *opaque) +{ + int ret = -1; + virPCIDeviceListPtr groupList = opaque; + virPCIDevicePtr newDev; + + if (!(newDev = virPCIDeviceNew(newDevAddr->domain, newDevAddr->bus, + newDevAddr->slot, newDevAddr->function))) { + goto cleanup; + } + + if (virPCIDeviceListAdd(groupList, newDev) < 0) { + goto cleanup; + } + newDev = NULL; /* it's now on the list */ + ret = 0; +cleanup: + virPCIDeviceFree(newDev); + return ret; +} + + +/* + * virPCIDeviceGetIOMMUGroupList - return a virPCIDeviceList containing + * all of the devices in the same iommu_group as @dev. + * + * Return the new list, or NULL on failure + */ +virPCIDeviceListPtr +virPCIDeviceGetIOMMUGroupList(virPCIDevicePtr dev) +{ + virPCIDeviceListPtr groupList = virPCIDeviceListNew(); + virPCIDeviceAddress devAddr = { dev->domain, dev->bus, + dev->slot, dev->function }; + + if (!groupList) + goto error; + + if (virPCIIOMMUGroupIterate(&devAddr, virPCIDeviceGetIOMMUGroupAddOne, + groupList) < 0) { + goto error; + } + return groupList; + +error: + virObjectUnref(groupList); + return NULL; +} + + +typedef struct { + virPCIDeviceAddressPtr **iommuGroupDevices; + size_t *nIommuGroupDevices; +} virPCIDeviceAddressList; +typedef virPCIDeviceAddressList *virPCIDeviceAddressListPtr; + +static int +virPCIGetIOMMUGroupAddressesAddOne(virPCIDeviceAddressPtr newDevAddr, void *opaque) +{ + int ret = -1; + virPCIDeviceAddressListPtr addrList = opaque; + virPCIDeviceAddressPtr copyAddr; + + /* make a copy to insert onto the list */ + if (VIR_ALLOC(copyAddr) < 0) + goto cleanup; + + *copyAddr = *newDevAddr; + + if (VIR_APPEND_ELEMENT(*addrList->iommuGroupDevices, + *addrList->nIommuGroupDevices, copyAddr) < 0) { + goto cleanup; + } + + ret = 0; +cleanup: + VIR_FREE(copyAddr); + return ret; +} + + +/* + * virPCIGetIOMMUGroupAddresses - return a virPCIDeviceList containing + * all of the devices in the same iommu_group as @dev. + * + * Return the new list, or NULL on failure + */ +int +virPCIGetIOMMUGroupAddresses(virPCIDeviceAddressPtr devAddr, + virPCIDeviceAddressPtr **iommuGroupDevices, + size_t *nIommuGroupDevices) +{ + int ret = -1; + virPCIDeviceAddressList addrList = { iommuGroupDevices, + nIommuGroupDevices }; + + if (virPCIIOMMUGroupIterate(devAddr, + virPCIGetIOMMUGroupAddressesAddOne, + &addrList) < 0) { + goto cleanup; + } + + ret = 0; +cleanup: + return ret; +} + + +/* virPCIGetIOMMUGroupNum - return the group number of this PCI + * device's iommu_group, or -2 if there is no iommu_group for the + * device (or -1 if there was any other error) + */ +int +virPCIGetIOMMUGroupNum(virPCIDeviceAddressPtr addr) +{ + char *devName = NULL; + char *devPath = NULL; + char *groupPath = NULL; + const char *groupNumStr; + unsigned int groupNum; + int ret = -1; + + if (virAsprintf(&devName, "%.4x:%.2x:%.2x.%.1x", addr->domain, + addr->bus, addr->slot, addr->function) < 0) { + virReportOOMError(); + goto cleanup; + } + + if (virPCIFile(&devPath, devName, "iommu_group") < 0) + goto cleanup; + if (virFileIsLink(devPath) != 1) { + ret = -2; + goto cleanup; + } + if (virFileResolveLink(devPath, &groupPath) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unable to resolve device %s iommu_group symlink %s"), + devName, devPath); + goto cleanup; + } + + groupNumStr = last_component(groupPath); + if (virStrToLong_ui(groupNumStr, NULL, 10, &groupNum) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("device %s iommu_group symlink %s has " + "invalid group number %s"), + devName, groupPath, groupNumStr); + ret = -1; + goto cleanup; + } + + ret = groupNum; +cleanup: + VIR_FREE(devName); + VIR_FREE(devPath); + VIR_FREE(groupPath); + return ret; +} + + /* virPCIDeviceGetIOMMUGroupDev - return the name of the device used * to control this PCI device's group (e.g. "/dev/vfio/15") */ diff --git a/src/util/virpci.h b/src/util/virpci.h index bcf1b81..972f86b 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -112,12 +112,21 @@ int virPCIDeviceListFindIndex(virPCIDeviceListPtr list, */ typedef int (*virPCIDeviceFileActor)(virPCIDevicePtr dev, const char *path, void *opaque); - int virPCIDeviceFileIterate(virPCIDevicePtr dev, virPCIDeviceFileActor actor, void *opaque); -char * -virPCIDeviceGetIOMMUGroupDev(virPCIDevicePtr dev); + +typedef int (*virPCIDeviceAddressActor)(virPCIDeviceAddress *addr, + void *opaque); +int virPCIIOMMUGroupIterate(virPCIDeviceAddressPtr orig, + virPCIDeviceAddressActor actor, + void *opaque); +virPCIDeviceListPtr virPCIDeviceGetIOMMUGroupList(virPCIDevicePtr dev); +int virPCIGetIOMMUGroupAddresses(virPCIDeviceAddressPtr devAddr, + virPCIDeviceAddressPtr **iommuGroupDevices, + size_t *nIommuGroupDevices); +int virPCIGetIOMMUGroupNum(virPCIDeviceAddressPtr dev); +char *virPCIDeviceGetIOMMUGroupDev(virPCIDevicePtr dev); int virPCIDeviceIsAssignable(virPCIDevicePtr dev, int strict_acs_check); -- 1.7.11.7

It will sometimes be desirable for a "managed" hostdev that is in a VFIO group with several other devices to cause *all* of the devices in the group to be detached/reattached. This patch gives that ability to virPCIDeviceDetach and virPCIDeviceReattach. To cause this, you would need to have previously called the new virPCIDeviceSetAttachGroup() function, which isn't yet called by anyone. (The "attach_group" setting must be set in the virPCIDevice object rather than passed as an argument to virPCIDevice(De|Reat)tach() because there is at least one case where the necessary information is available when the virPCIDevice is created, but is *not* available at a later time when the attach function is called.) NB: the call to virPCIDeviceReattachInit() was moved from qemu_driver.c:qemuNodeDeviceReAttach() to virpci.c:virPCIDeviceReattach() because it must be done for each device that is being re-attached, and the qemu function only knows about a single device. It also appears that moving the call to virPCIDeviceReattachInit() coincidentally fixes a bug for the xen driver, since no reattach would ever take place without the setting of the flags done by virPCIDeviceReattachInit(), and the xen driver's implementation of xenNodeDeviceReAttach() was never calling it - since it is now called by virPCIDeviceReattach(), re-attach should work properly for the xen driver. --- src/libvirt_private.syms | 2 + src/qemu/qemu_driver.c | 4 +- src/qemu/qemu_hostdev.c | 4 +- src/util/virpci.c | 254 ++++++++++++++++++++++++++++++++++++++++++----- src/util/virpci.h | 6 +- src/xen/xen_driver.c | 2 +- 6 files changed, 241 insertions(+), 31 deletions(-) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index dd3a138..9a877ff 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1652,6 +1652,7 @@ virPCIDeviceCopy; virPCIDeviceDetach; virPCIDeviceFileIterate; virPCIDeviceFree; +virPCIDeviceGetAttachGroup; virPCIDeviceGetIOMMUGroupDev; virPCIDeviceGetIOMMUGroupList; virPCIDeviceGetManaged; @@ -1676,6 +1677,7 @@ virPCIDeviceNew; virPCIDeviceReattach; virPCIDeviceReattachInit; virPCIDeviceReset; +virPCIDeviceSetAttachGroup; virPCIDeviceSetManaged; virPCIDeviceSetRemoveSlot; virPCIDeviceSetReprobe; diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 20b127c..8f498e6 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -10351,10 +10351,8 @@ qemuNodeDeviceReAttach(virNodeDevicePtr dev) goto out; } - virPCIDeviceReattachInit(pci); - if (virPCIDeviceReattach(pci, driver->activePciHostdevs, - driver->inactivePciHostdevs) < 0) + driver->inactivePciHostdevs, true) < 0) goto out; ret = 0; diff --git a/src/qemu/qemu_hostdev.c b/src/qemu/qemu_hostdev.c index dfe39c6..1abad59 100644 --- a/src/qemu/qemu_hostdev.c +++ b/src/qemu/qemu_hostdev.c @@ -661,7 +661,7 @@ reattachdevs: /* NB: This doesn't actually re-bind to original driver, just * unbinds from the stub driver */ - virPCIDeviceReattach(dev, driver->activePciHostdevs, NULL); + virPCIDeviceReattach(dev, driver->activePciHostdevs, NULL, false); } cleanup: @@ -1042,7 +1042,7 @@ void qemuReattachPciDevice(virPCIDevicePtr dev, virQEMUDriverPtr driver) } if (virPCIDeviceReattach(dev, driver->activePciHostdevs, - driver->inactivePciHostdevs) < 0) { + driver->inactivePciHostdevs, false) < 0) { virErrorPtr err = virGetLastError(); VIR_ERROR(_("Failed to re-attach PCI device: %s"), err ? err->message : _("unknown error")); diff --git a/src/util/virpci.c b/src/util/virpci.c index cbade1b..80256f9 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -39,6 +39,7 @@ #include "dirname.h" #include "virlog.h" #include "viralloc.h" +#include "virbitmap.h" #include "vircommand.h" #include "virerror.h" #include "virfile.h" @@ -70,6 +71,7 @@ struct _virPCIDevice { bool has_pm_reset; bool managed; char *stubDriver; + bool attach_group; /* true if okay to attach/detach entire group */ /* used by reattach function */ bool unbind_from_stub; @@ -967,7 +969,7 @@ static const char *virPCIKnownStubs[] = { }; static int -virPCIDeviceUnbindFromStub(virPCIDevicePtr dev) +virPCIDeviceUnbindFromStub(virPCIDevicePtr dev, const char *stubDriver) { int result = -1; char *drvdir = NULL; @@ -985,6 +987,13 @@ virPCIDeviceUnbindFromStub(virPCIDevicePtr dev) if (!dev->unbind_from_stub) goto remove_slot; + /* if the caller only wants us to unbind if the device is + * currently bound to one specific stub driver, that driver name + * will be in @stubDriver. + */ + if (stubDriver && STRNEQ(driver, stubDriver)) + goto remove_slot; + /* If the device isn't bound to a known stub, skip the unbind. */ for (stubTest = virPCIKnownStubs; *stubTest != NULL; stubTest++) { if (STREQ(driver, *stubTest)) { @@ -1067,9 +1076,19 @@ cleanup: } +/* these drivers are acceptable substitutes for vfio-pci + * in virPCIDeviceBindToStub(), unless "force" is true. + */ +static const char *virPCIVFIOSubstitutes[] = { + "pci-stub", + "pcieport", + NULL +}; + static int virPCIDeviceBindToStub(virPCIDevicePtr dev, - const char *stubDriverName) + const char *stubDriverName, + bool force, bool *bound) { int result = -1; int reprobe = false; @@ -1081,6 +1100,9 @@ virPCIDeviceBindToStub(virPCIDevicePtr dev, char *path = NULL; /* reused for different purposes */ const char *newDriverName = NULL; + /* this will be set if we change the binding */ + *bound = false; + if (virPCIDriverDir(&stubDriverPath, stubDriverName) < 0 || virPCIFile(&driverLink, dev->name, "driver") < 0 || virPCIDeviceGetDriverPathAndName(dev, &oldDriverPath, @@ -1088,6 +1110,7 @@ virPCIDeviceBindToStub(virPCIDevicePtr dev, goto cleanup; } + /* check whether the device is already bound to an acceptable driver */ if (virFileExists(driverLink)) { if (virFileLinkPointsTo(driverLink, stubDriverPath)) { /* The device is already bound to the correct driver */ @@ -1097,6 +1120,28 @@ virPCIDeviceBindToStub(virPCIDevicePtr dev, result = 0; goto cleanup; } + /* If force == false, this device is just a secondary driver + * in the same VFIO group as the primary driver we want bound + * to the stub (rather than the primary device that is the + * real target of the bind). In that case, it's okay for it to + * instead be bound to one of the acceptable "substitute" + * drivers (which are deemed by VFIO to be safe for isolation, + * and are on an internal "whitelist" in the vfio driver). + */ + if (!force && oldDriverName && STREQ(stubDriverName, "vfio-pci")) { + const char **stubTest; + for (stubTest = virPCIVFIOSubstitutes; + *stubTest != NULL; stubTest++) { + if (STREQ(oldDriverName, *stubTest)) { + VIR_DEBUG("Device %s is bound to acceptable " + "substitute driver %s", + dev->name, *stubTest); + newDriverName = *stubTest; + result = 0; + goto cleanup; + } + } + } reprobe = true; } @@ -1119,6 +1164,8 @@ virPCIDeviceBindToStub(virPCIDevicePtr dev, goto cleanup; } + *bound = true; + /* check whether the device is bound to pci-stub when we write dev->id to * ${stubDriver}/new_id. */ @@ -1221,7 +1268,7 @@ cleanup: result = VIR_STRDUP(dev->stubDriver, newDriverName); } if (result < 0) - virPCIDeviceUnbindFromStub(dev); + virPCIDeviceUnbindFromStub(dev, stubDriverName); return result; } @@ -1249,54 +1296,201 @@ virPCIDeviceDetach(virPCIDevicePtr dev, virPCIDeviceList *activeDevs, virPCIDeviceList *inactiveDevs) { + /* grouplist contains all devices in the group, even dev itself. */ + virPCIDeviceList *groupList = NULL; + virPCIDevicePtr current, copy = NULL; + virBitmapPtr newlyDetached = NULL; + bool bound; + int ii, ret = -1; + if (virPCIProbeStubDriver(dev->stubDriver) < 0) { virReportError(VIR_ERR_INTERNAL_ERROR, _("Failed to load PCI stub module %s"), dev->stubDriver); - return -1; + goto cleanup; } if (activeDevs && virPCIDeviceListFind(activeDevs, dev)) { - virReportError(VIR_ERR_INTERNAL_ERROR, + virReportError(VIR_ERR_OPERATION_INVALID, _("Not detaching active device %s"), dev->name); - return -1; + goto cleanup; } - if (virPCIDeviceBindToStub(dev, dev->stubDriver) < 0) - return -1; + if (dev->attach_group && STREQ(dev->stubDriver, "vfio-pci")) { + /* Get a list of all devices in the same vfio group as this one */ + if (!(groupList = virPCIDeviceGetIOMMUGroupList(dev))) + goto cleanup; + } else { + /* simple case - just detach this one device */ + if (!(groupList = virPCIDeviceListNew()) || + !(copy = virPCIDeviceCopy(dev)) || + virPCIDeviceListAdd(groupList, copy) < 0) { + goto cleanup; + } + copy = NULL; + } - /* Add *a copy of* the dev into list inactiveDevs, if - * it's not already there. + if (!(newlyDetached = virBitmapNew(virPCIDeviceListCount(groupList)))) + goto cleanup; + + /* * groupList contains a copy of all devices that should be + * detached (bound to the stub) + * + * * dev is the one that *must* be detached. + * + * * detachedDevs is a bitmap that has the bit set for each item + * in groupList that has been newly detached (and will need to + * be reattach in case of failure) */ - if (inactiveDevs && !virPCIDeviceListFind(inactiveDevs, dev)) { - virPCIDevicePtr copy = virPCIDeviceCopy(dev); - if ((!copy) || virPCIDeviceListAdd(inactiveDevs, copy) < 0) - return -1; + for (ii = 0; (current = virPCIDeviceListGet(groupList, ii)); ii++) { + /* If any device other than the "primary" (the one originally + * requested) is on the active list (already assigned to a + * domain), just skip detach - we can't tell right now if it's + * in use by the correct domain, but that will all come out + * when we get to the point of doing the assign. (NB: We've + * already determined that the primary *isn't* on the active + * list) + */ + if (virPCIDeviceListFind(activeDevs, current)) + continue; + + /* bind to stub (force if this is the original device) */ + if (virPCIDeviceBindToStub(current, dev->stubDriver, + STREQ(current->name, dev->name), + &bound) < 0) { + goto cleanup; + } + + /* remember if we detached it */ + if (bound) + ignore_value(virBitmapSetBit(newlyDetached, ii)); + + /* add to inactive */ + if (inactiveDevs && !virPCIDeviceListFind(inactiveDevs, current) && + (!(copy = virPCIDeviceCopy(current)) || + virPCIDeviceListAdd(inactiveDevs, copy) < 0)) { + goto cleanup; + } + copy = NULL; /* it's on the list now */ } - return 0; + ret = 0; +cleanup: + + if (ret < 0) { + /* try to cleanup what was halfway done if we fail */ + if (groupList && newlyDetached) { + for (ii = 0; (current = virPCIDeviceListGet(groupList, ii)); ii++) { + ignore_value(virBitmapGetBit(newlyDetached, ii, &bound)); + if (bound) + virPCIDeviceUnbindFromStub(current, dev->stubDriver); + } + } + } + + virObjectUnref(groupList); + virPCIDeviceFree(copy); + virBitmapFree(newlyDetached); + return ret; } +/* virPCIDeviceReattach: + * + * Re-attach this device (and all devices in the same group, if + * dev->attach_group is true) to its host driver. + * + * if @force is true, return failure if the given device couldn't be + * re-attached due to another device in the group being buse. If + * @force is false, ignore this circumstance (i.e., don't re-attach + * the device to the host, but return success). + * + * Return 0 on success, -1 on failure. + */ int virPCIDeviceReattach(virPCIDevicePtr dev, virPCIDeviceListPtr activeDevs, - virPCIDeviceListPtr inactiveDevs) + virPCIDeviceListPtr inactiveDevs, + bool force) { + virPCIDeviceList *groupList = NULL; + virPCIDevicePtr current, copy = NULL; + char *drvPath = NULL, *drvName = NULL; + int ii, ret = -1; + if (activeDevs && virPCIDeviceListFind(activeDevs, dev)) { - virReportError(VIR_ERR_INTERNAL_ERROR, + virReportError(VIR_ERR_OPERATION_INVALID, _("Not reattaching active device %s"), dev->name); - return -1; + goto cleanup; } - if (virPCIDeviceUnbindFromStub(dev) < 0) - return -1; + if (virPCIDeviceGetDriverPathAndName(dev, &drvPath, &drvName) < 0) + goto cleanup; - /* Steal the dev from list inactiveDevs */ - if (inactiveDevs) - virPCIDeviceListDel(inactiveDevs, dev); + if (dev->attach_group && STREQ_NULLABLE(drvName, "vfio-pci")) { + /* Get a list of all devices in the same vfio group as this one */ + if (!(groupList = virPCIDeviceGetIOMMUGroupList(dev))) + goto cleanup; + } else { + /* simple case - just reattach this one device */ + if (!(groupList = virPCIDeviceListNew()) || + !(copy = virPCIDeviceCopy(dev)) || + virPCIDeviceListAdd(groupList, copy) < 0) { + goto cleanup; + } + copy = NULL; - return 0; + if (activeDevs) { + /* If any device in the list is still in use by a + * guest, don't reattach any of them to the host. + */ + for (ii = 0; (current = virPCIDeviceListGet(groupList, ii)); ii++) { + if (virPCIDeviceListFind(activeDevs, current)) { + if (force) { + virReportError(VIR_ERR_OPERATION_INVALID, + _("Not reattaching device %s " + "because it is in the same group " + "as %s which is still in use by a " + "domain."), + dev->name, current->name); + } else { + /* if force is false, this isn't an error */ + VIR_DEBUG("Not reattaching device %s " + "because it is in the same group " + "as %s which is still in use by a domain.", + dev->name, current->name); + ret = 0; + } + goto cleanup; + } + } + } + } + + /* if we get this far, all devices in this group are free to be + * reattached to the host driver + */ + for (ii = 0; (current = virPCIDeviceListGet(groupList, ii)); ii++) { + /* reattach to the host *IFF* this device is currently bound + * to exactly the same stub driver as the requested device is + * bound to. */ + + virPCIDeviceReattachInit(current); /* won't reattach w/o these set */ + if (virPCIDeviceUnbindFromStub(current, drvName) < 0) + goto cleanup; + /* Remove reattached device from inactiveDevs */ + if (inactiveDevs) + virPCIDeviceListDel(inactiveDevs, current); + } + + ret = 0; +cleanup: + + VIR_FREE(drvPath); + VIR_FREE(drvName); + virObjectUnref(groupList); + virPCIDeviceFree(copy); + return ret; } /* Certain hypervisors (like qemu/kvm) map the PCI bar(s) on @@ -1600,6 +1794,18 @@ virPCIDeviceGetStubDriver(virPCIDevicePtr dev) return dev->stubDriver; } +void +virPCIDeviceSetAttachGroup(virPCIDevicePtr dev, bool allow) +{ + dev->attach_group = allow; +} + +bool +virPCIDeviceGetAttachGroup(virPCIDevicePtr dev) +{ + return dev->attach_group; +} + unsigned int virPCIDeviceGetUnbindFromStub(virPCIDevicePtr dev) { diff --git a/src/util/virpci.h b/src/util/virpci.h index 972f86b..c598456 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -54,7 +54,8 @@ int virPCIDeviceDetach(virPCIDevicePtr dev, virPCIDeviceListPtr inactiveDevs); int virPCIDeviceReattach(virPCIDevicePtr dev, virPCIDeviceListPtr activeDevs, - virPCIDeviceListPtr inactiveDevs); + virPCIDeviceListPtr inactiveDevs, + bool force); int virPCIDeviceReset(virPCIDevicePtr dev, virPCIDeviceListPtr activeDevs, virPCIDeviceListPtr inactiveDevs); @@ -65,6 +66,9 @@ unsigned int virPCIDeviceGetManaged(virPCIDevice *dev); int virPCIDeviceSetStubDriver(virPCIDevicePtr dev, const char *driver); const char *virPCIDeviceGetStubDriver(virPCIDevicePtr dev); +void virPCIDeviceSetAttachGroup(virPCIDevicePtr dev, + bool allow); +bool virPCIDeviceGetAttachGroup(virPCIDevicePtr dev); void virPCIDeviceSetUsedBy(virPCIDevice *dev, const char *used_by); const char *virPCIDeviceGetUsedBy(virPCIDevice *dev); diff --git a/src/xen/xen_driver.c b/src/xen/xen_driver.c index 6724a36..c9e4db8 100644 --- a/src/xen/xen_driver.c +++ b/src/xen/xen_driver.c @@ -2341,7 +2341,7 @@ xenUnifiedNodeDeviceReAttach(virNodeDevicePtr dev) goto out; } - if (virPCIDeviceReattach(pci, NULL, NULL) < 0) + if (virPCIDeviceReattach(pci, NULL, NULL, false) < 0) goto out; ret = 0; -- 1.7.11.7

This provides a frontend for the virpci internal "detach_group" flag. When virNodeDeviceDetachFlags is given the flag VIR_NODE_DEVICE_DETACH_GROUP (and the specified driver is "vfio"), all devices in the same VFIO group with the specified device will be detached from the host and bound to the vfio-pci driver (unless they have already been bound to either the pci-stub driver or pcieport driver, both of which are acceptable substitutes in the eyes of VFIO). --- include/libvirt/libvirt.h.in | 10 ++++++++++ src/qemu/qemu_driver.c | 5 ++++- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/include/libvirt/libvirt.h.in b/include/libvirt/libvirt.h.in index acf3218..ba471b2 100644 --- a/include/libvirt/libvirt.h.in +++ b/include/libvirt/libvirt.h.in @@ -3291,6 +3291,16 @@ char * virNodeDeviceGetXMLDesc (virNodeDevicePtr dev, int virNodeDeviceRef (virNodeDevicePtr dev); int virNodeDeviceFree (virNodeDevicePtr dev); +/* + * virNodeDeviceDetachFlags: + * + * Flags that can be sent to the virNodeDeviceDetachFlags() API. + * + */ +typedef enum { + VIR_NODE_DEVICE_DETACH_GROUP = 1 << 0, /* detach all devices in same group */ +} virNodeDeviceDetachFlagsType; + int virNodeDeviceDettach (virNodeDevicePtr dev); int virNodeDeviceDetachFlags(virNodeDevicePtr dev, const char *driverName, diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 8f498e6..dc6f2cd 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -10275,7 +10275,7 @@ qemuNodeDeviceDetachFlags(virNodeDevicePtr dev, unsigned domain, bus, slot, function; int ret = -1; - virCheckFlags(0, -1); + virCheckFlags(VIR_NODE_DEVICE_DETACH_GROUP, -1); if (qemuNodeDeviceGetPciInfo(dev, &domain, &bus, &slot, &function) < 0) return -1; @@ -10285,9 +10285,12 @@ qemuNodeDeviceDetachFlags(virNodeDevicePtr dev, return -1; if (!driverName || STREQ(driverName, "kvm")) { + virCheckFlags(0, -1); if (virPCIDeviceSetStubDriver(pci, "pci-stub") < 0) goto out; } else if (STREQ(driverName, "vfio")) { + virPCIDeviceSetAttachGroup(pci, + !!(flags & VIR_NODE_DEVICE_DETACH_GROUP)); if (virPCIDeviceSetStubDriver(pci, "vfio-pci") < 0) goto out; } else { -- 1.7.11.7

The nodedev-detach command gains a "--group" option, which will instruct libvirt to detach all devices in the same group as the specified device. This option requires the recently added virNodeDeviceDetachFlags() API in libvirtd. --- tools/virsh-nodedev.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/tools/virsh-nodedev.c b/tools/virsh-nodedev.c index da93b8e..39f74f2 100644 --- a/tools/virsh-nodedev.c +++ b/tools/virsh-nodedev.c @@ -607,6 +607,10 @@ static const vshCmdOptDef opts_node_device_detach[] = { .type = VSH_OT_STRING, .help = N_("pci device assignment backend driver (e.g. 'vfio' or 'kvm'") }, + {.name = "group", + .type = VSH_OT_BOOL, + .help = N_("detach all devices in the same IOMMU group") + }, {.name = NULL} }; @@ -616,6 +620,7 @@ cmdNodeDeviceDetach(vshControl *ctl, const vshCmd *cmd) const char *name = NULL; const char *driverName = NULL; virNodeDevicePtr device; + bool group = vshCommandOptBool(cmd, "group"); bool ret = true; if (vshCommandOptStringReq(ctl, cmd, "device", &name) < 0) @@ -628,9 +633,13 @@ cmdNodeDeviceDetach(vshControl *ctl, const vshCmd *cmd) return false; } - if (driverName) { + if (driverName || group) { /* we must use the newer API that accepts a driverName */ - if (virNodeDeviceDetachFlags(device, driverName, 0) < 0) + unsigned int flags = 0; + + if (group) + flags |= VIR_NODE_DEVICE_DETACH_GROUP; + if (virNodeDeviceDetachFlags(device, driverName, flags) < 0) ret = false; } else { /* Yes, our (old) public API is misspelled. At least virsh -- 1.7.11.7

The virNodeDeviceDettach() API recently required an update via the addition of a "driver" and "flags" argument in the new virNodeDeviceDetachFlags() API. This was initially done so that the user could specify which stub driver was to be attached (with a new "driver" char* argument), and a flags argument was added for good measure. The flags argument has now been put to good use in order to specify whether all devices in the same group as the specified device should also be detached. It turns out that, although the driver name isn't required for virNodeDeviceReAttach() (since it can be found by looking at the device's driver binding), it *is* necessary to provide a "reattach entire group" flag. Because of this, we now also need a new virNodeDeviceReAttachFlags() API, which is otherwise identical to virNodeDeviceReAttach(). A single flag is defined for this new API: VIR_NODE_DEVICE_REATTACH_GROUP The effect of this flag is the following: If this device was bound to the vfio-pci driver, only re-attach it to its host driver if all other devices in the same group are also not assigned to any guest, and in that case re-attach *all* devices in the group that are currently attached to the vfio-pci driver to their respective host drivers. (In practice this is actually much less confusing than it sounds in this explanation). --- include/libvirt/libvirt.h.in | 8 ++++++ src/driver.h | 5 ++++ src/libvirt.c | 60 ++++++++++++++++++++++++++++++++++++++++++++ src/libvirt_public.syms | 4 +++ 4 files changed, 77 insertions(+) diff --git a/include/libvirt/libvirt.h.in b/include/libvirt/libvirt.h.in index ba471b2..7343665 100644 --- a/include/libvirt/libvirt.h.in +++ b/include/libvirt/libvirt.h.in @@ -3305,7 +3305,15 @@ int virNodeDeviceDettach (virNodeDevicePtr dev); int virNodeDeviceDetachFlags(virNodeDevicePtr dev, const char *driverName, unsigned int flags); + +typedef enum { + VIR_NODE_DEVICE_REATTACH_GROUP = 1 << 0, /* reattach all devices in same group */ +} virNodeDeviceReAttachFlagsType; + int virNodeDeviceReAttach (virNodeDevicePtr dev); +int virNodeDeviceReAttachFlags(virNodeDevicePtr dev, + unsigned int flags); + int virNodeDeviceReset (virNodeDevicePtr dev); virNodeDevicePtr virNodeDeviceCreateXML (virConnectPtr conn, diff --git a/src/driver.h b/src/driver.h index ec5fc53..f5b7005 100644 --- a/src/driver.h +++ b/src/driver.h @@ -626,6 +626,10 @@ typedef int (*virDrvNodeDeviceReAttach)(virNodeDevicePtr dev); typedef int +(*virDrvNodeDeviceReAttachFlags)(virNodeDevicePtr dev, + unsigned int flags); + +typedef int (*virDrvNodeDeviceReset)(virNodeDevicePtr dev); typedef int @@ -1173,6 +1177,7 @@ struct _virDriver { virDrvNodeDeviceDettach nodeDeviceDettach; virDrvNodeDeviceDetachFlags nodeDeviceDetachFlags; virDrvNodeDeviceReAttach nodeDeviceReAttach; + virDrvNodeDeviceReAttachFlags nodeDeviceReAttachFlags; virDrvNodeDeviceReset nodeDeviceReset; virDrvDomainMigratePrepareTunnel domainMigratePrepareTunnel; virDrvConnectIsEncrypted connectIsEncrypted; diff --git a/src/libvirt.c b/src/libvirt.c index db120b7..4144998 100644 --- a/src/libvirt.c +++ b/src/libvirt.c @@ -14839,6 +14839,10 @@ error: * * If the device is currently in use by a guest, this method may fail. * + * If the caller needs control over any options during the reattach + * then the newer virNodeDeviceReAttachFlags() API should be used + * instead. + * * Returns 0 in case of success, -1 in case of failure. */ int @@ -14875,6 +14879,62 @@ error: } /** + * virNodeDeviceReAttachFlags: + * @dev: pointer to the node device + * @flags: extra flags; Currently there is one possible flag: + * + * VIR_NODE_DEVICE_REATTACH_GROUP: If this device was bound to the + * vfio-pci driver, only re-attach it to its host driver if all + * other devices in the same group are also not assigned to any + * guest, and in that case re-attach *all* devices in the group + * that are currently attached to the vfio-pci driver to their + * respective host drivers. + * + * Re-attach a previously dettached node device to the node so that it + * may be used by the node again. + * + * Depending on the hypervisor, this may involve operations such + * as resetting the device, unbinding it from a dummy device driver + * and binding it to its appropriate driver. + * + * If the device is currently in use by a guest, this method may fail. + * + * Returns 0 in case of success, -1 in case of failure. + */ +int +virNodeDeviceReAttachFlags(virNodeDevicePtr dev, unsigned int flags) +{ + VIR_DEBUG("dev=%p, conn=%p flags=%x", dev, dev ? dev->conn : NULL, flags); + + virResetLastError(); + + if (!VIR_IS_CONNECTED_NODE_DEVICE(dev)) { + virLibNodeDeviceError(VIR_ERR_INVALID_NODE_DEVICE, __FUNCTION__); + virDispatchError(NULL); + return -1; + } + + if (dev->conn->flags & VIR_CONNECT_RO) { + virLibConnError(VIR_ERR_OPERATION_DENIED, __FUNCTION__); + goto error; + } + + if (dev->conn->driver->nodeDeviceReAttachFlags) { + int ret; + ret = dev->conn->driver->nodeDeviceReAttachFlags(dev, flags); + if (ret < 0) + goto error; + return ret; + } + + virLibConnError(VIR_ERR_NO_SUPPORT, __FUNCTION__); + +error: + virDispatchError(dev->conn); + return -1; +} + +/** * virNodeDeviceReset: * @dev: pointer to the node device * diff --git a/src/libvirt_public.syms b/src/libvirt_public.syms index 4ee2d27..82968a0 100644 --- a/src/libvirt_public.syms +++ b/src/libvirt_public.syms @@ -621,4 +621,8 @@ LIBVIRT_1.0.6 { virGetLastErrorMessage; } LIBVIRT_1.0.5; +LIBVIRT_1.0.7 { + global: + virNodeDeviceReAttachFlags; +} LIBVIRT_1.0.6; # .... define new API here using predicted next version number .... -- 1.7.11.7

This requires a custom function for remoteNodeDeviceReAttachFlags, because it is named *NodeDevice, but it goes through the hypervisor driver rather than nodedevice driver, and so it uses privateData instead of nodeDevicePrivateData. (It has to go through the hypervisor driver, because that is the driver that knows about the backend drivers that will perform the pci device assignment). --- src/remote/remote_driver.c | 29 +++++++++++++++++++++++++++++ src/remote/remote_protocol.x | 12 +++++++++++- src/remote_protocol-structs | 5 +++++ 3 files changed, 45 insertions(+), 1 deletion(-) diff --git a/src/remote/remote_driver.c b/src/remote/remote_driver.c index 87c61f4..2c006d9 100644 --- a/src/remote/remote_driver.c +++ b/src/remote/remote_driver.c @@ -3497,6 +3497,34 @@ done: } static int +remoteNodeDeviceReAttachFlags(virNodeDevicePtr dev, + unsigned int flags) +{ + int rv = -1; + remote_node_device_re_attach_flags_args args; + /* This method is unusual in that it uses the HV driver, not the + * devMon driver hence its use of privateData, instead of + * nodeDevicePrivateData */ + struct private_data *priv = dev->conn->privateData; + + remoteDriverLock(priv); + + args.name = dev->name; + args.flags = flags; + + if (call(dev->conn, priv, 0, REMOTE_PROC_NODE_DEVICE_RE_ATTACH_FLAGS, + (xdrproc_t) xdr_remote_node_device_re_attach_flags_args, + (char *) &args, (xdrproc_t) xdr_void, (char *) NULL) == -1) + goto done; + + rv = 0; + +done: + remoteDriverUnlock(priv); + return rv; +} + +static int remoteNodeDeviceReset(virNodeDevicePtr dev) { int rv = -1; @@ -6291,6 +6319,7 @@ static virDriver remote_driver = { .nodeDeviceDettach = remoteNodeDeviceDettach, /* 0.6.1 */ .nodeDeviceDetachFlags = remoteNodeDeviceDetachFlags, /* 1.0.5 */ .nodeDeviceReAttach = remoteNodeDeviceReAttach, /* 0.6.1 */ + .nodeDeviceReAttachFlags = remoteNodeDeviceReAttachFlags, /* 1.0.7 */ .nodeDeviceReset = remoteNodeDeviceReset, /* 0.6.1 */ .domainMigratePrepareTunnel = remoteDomainMigratePrepareTunnel, /* 0.7.2 */ .connectIsEncrypted = remoteConnectIsEncrypted, /* 0.7.3 */ diff --git a/src/remote/remote_protocol.x b/src/remote/remote_protocol.x index 9723377..e32898e 100644 --- a/src/remote/remote_protocol.x +++ b/src/remote/remote_protocol.x @@ -1910,6 +1910,11 @@ struct remote_node_device_re_attach_args { remote_nonnull_string name; }; +struct remote_node_device_re_attach_flags_args { + remote_nonnull_string name; + unsigned int flags; +}; + struct remote_node_device_reset_args { remote_nonnull_string name; }; @@ -4434,6 +4439,11 @@ enum remote_procedure { /** * @generate: server */ - REMOTE_PROC_NODE_DEVICE_DETACH_FLAGS = 301 + REMOTE_PROC_NODE_DEVICE_DETACH_FLAGS = 301, + + /** + * @generate: server + */ + REMOTE_PROC_NODE_DEVICE_RE_ATTACH_FLAGS = 302 }; diff --git a/src/remote_protocol-structs b/src/remote_protocol-structs index ea38ea2..6bcf425 100644 --- a/src/remote_protocol-structs +++ b/src/remote_protocol-structs @@ -1448,6 +1448,10 @@ struct remote_node_device_detach_flags_args { struct remote_node_device_re_attach_args { remote_nonnull_string name; }; +struct remote_node_device_re_attach_flags_args { + remote_nonnull_string name; + u_int flags; +}; struct remote_node_device_reset_args { remote_nonnull_string name; }; @@ -2494,4 +2498,5 @@ enum remote_procedure { REMOTE_PROC_DOMAIN_MIGRATE_GET_COMPRESSION_CACHE = 299, REMOTE_PROC_DOMAIN_MIGRATE_SET_COMPRESSION_CACHE = 300, REMOTE_PROC_NODE_DEVICE_DETACH_FLAGS = 301, + REMOTE_PROC_NODE_DEVICE_RE_ATTACH_FLAGS = 302, }; -- 1.7.11.7

The only difference from virNodeDeviceReAttach() is that the attach_group bool is set in the virPCIDevice object if appropriate. --- src/qemu/qemu_driver.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index dc6f2cd..16b8805 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -10321,8 +10321,10 @@ qemuNodeDeviceDettach(virNodeDevicePtr dev) return qemuNodeDeviceDetachFlags(dev, NULL, 0); } + static int -qemuNodeDeviceReAttach(virNodeDevicePtr dev) +qemuNodeDeviceReAttachFlags(virNodeDevicePtr dev, + unsigned int flags) { virQEMUDriverPtr driver = dev->conn->privateData; virPCIDevicePtr pci; @@ -10330,6 +10332,8 @@ qemuNodeDeviceReAttach(virNodeDevicePtr dev) unsigned domain, bus, slot, function; int ret = -1; + virCheckFlags(VIR_NODE_DEVICE_REATTACH_GROUP, -1); + if (qemuNodeDeviceGetPciInfo(dev, &domain, &bus, &slot, &function) < 0) return -1; @@ -10337,6 +10341,8 @@ qemuNodeDeviceReAttach(virNodeDevicePtr dev) if (!pci) return -1; + virPCIDeviceSetAttachGroup(pci, !!(flags & VIR_NODE_DEVICE_REATTACH_GROUP)); + virObjectLock(driver->activePciHostdevs); virObjectLock(driver->inactivePciHostdevs); other = virPCIDeviceListFind(driver->activePciHostdevs, pci); @@ -10366,6 +10372,14 @@ out: return ret; } + +static int +qemuNodeDeviceReAttach(virNodeDevicePtr dev) +{ + return qemuNodeDeviceReAttachFlags(dev, 0); +} + + static int qemuNodeDeviceReset(virNodeDevicePtr dev) { @@ -15312,6 +15326,7 @@ static virDriver qemuDriver = { .nodeDeviceDettach = qemuNodeDeviceDettach, /* 0.6.1 */ .nodeDeviceDetachFlags = qemuNodeDeviceDetachFlags, /* 1.0.5 */ .nodeDeviceReAttach = qemuNodeDeviceReAttach, /* 0.6.1 */ + .nodeDeviceReAttachFlags = qemuNodeDeviceReAttachFlags, /* 1.0.7 */ .nodeDeviceReset = qemuNodeDeviceReset, /* 0.6.1 */ .domainMigratePrepareTunnel = qemuDomainMigratePrepareTunnel, /* 0.7.2 */ .connectIsEncrypted = qemuConnectIsEncrypted, /* 0.7.3 */ -- 1.7.11.7

Since the xen driver doesn't support the vfio driver (which is the only one that would benefit from the VIR_NODE_DEVICE_REATTACH_GROUP flag), the Xen version of this function (xenUnifiedNodeDeviceReAttachFlags()) just validates that flags == 0, but is otherwise identical to the old xenUnifiedNodeDeviceReAttach() --- src/xen/xen_driver.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/src/xen/xen_driver.c b/src/xen/xen_driver.c index c9e4db8..e745318 100644 --- a/src/xen/xen_driver.c +++ b/src/xen/xen_driver.c @@ -2318,14 +2318,18 @@ out: return ret; } + static int -xenUnifiedNodeDeviceReAttach(virNodeDevicePtr dev) +xenUnifiedNodeDeviceReAttachFlags(virNodeDevicePtr dev, + unsigned int flags) { virPCIDevicePtr pci; unsigned domain, bus, slot, function; int ret = -1; int domid; + virCheckFlags(0, -1); + if (xenUnifiedNodeDeviceGetPciInfo(dev, &domain, &bus, &slot, &function) < 0) return -1; @@ -2350,6 +2354,14 @@ out: return ret; } + +static int +xenUnifiedNodeDeviceReAttach(virNodeDevicePtr dev) +{ + return xenUnifiedNodeDeviceReAttachFlags(dev, 0); +} + + static int xenUnifiedNodeDeviceReset(virNodeDevicePtr dev) { @@ -2544,6 +2556,7 @@ static virDriver xenUnifiedDriver = { .nodeDeviceDettach = xenUnifiedNodeDeviceDettach, /* 0.6.1 */ .nodeDeviceDetachFlags = xenUnifiedNodeDeviceDetachFlags, /* 1.0.5 */ .nodeDeviceReAttach = xenUnifiedNodeDeviceReAttach, /* 0.6.1 */ + .nodeDeviceReAttachFlags = xenUnifiedNodeDeviceReAttachFlags, /* 1.0.7 */ .nodeDeviceReset = xenUnifiedNodeDeviceReset, /* 0.6.1 */ .connectIsEncrypted = xenUnifiedConnectIsEncrypted, /* 0.7.3 */ .connectIsSecure = xenUnifiedConnectIsSecure, /* 0.7.3 */ -- 1.7.11.7

If --group is specified on the virsh nodedev-reattach commandline, the flag VIR_NODE_DEVICE_REATTACH_GROUP is used and the new virNodeDeviceReAttachFlags() API is called (instead of virNodeDeviceReAttach()). This causes the following change in behavior (only if the given device is currently attached to the vfio-pci driver): 1) If any device in the same iommu group as the given device is still in use by a domain, no reattachment of any device is performed (i.e. this device remains attached to vfio-pci), and an error is reported. 2) If none of the devices in the same iommu group as the given device are in use by a domain, then *all* devices that are currently attached to vfio-pci are re-attached to their respective host drivers (any devices that are currently attached to pci-stub or pcieport will remain attached to those drivers). --- tools/virsh-nodedev.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/tools/virsh-nodedev.c b/tools/virsh-nodedev.c index 39f74f2..8282129 100644 --- a/tools/virsh-nodedev.c +++ b/tools/virsh-nodedev.c @@ -677,6 +677,10 @@ static const vshCmdOptDef opts_node_device_reattach[] = { .flags = VSH_OFLAG_REQ, .help = N_("device key") }, + {.name = "group", + .type = VSH_OT_BOOL, + .help = N_("re-attach all devices in the same IOMMU group, but only if none are in use") + }, {.name = NULL} }; @@ -685,6 +689,7 @@ cmdNodeDeviceReAttach(vshControl *ctl, const vshCmd *cmd) { const char *name = NULL; virNodeDevicePtr device; + bool group = vshCommandOptBool(cmd, "group"); bool ret = true; if (vshCommandOptStringReq(ctl, cmd, "device", &name) < 0) @@ -695,7 +700,17 @@ cmdNodeDeviceReAttach(vshControl *ctl, const vshCmd *cmd) return false; } - if (virNodeDeviceReAttach(device) == 0) { + if (group) { + unsigned int flags = VIR_NODE_DEVICE_REATTACH_GROUP; + + if (virNodeDeviceReAttachFlags(device, flags) < 0) + ret = false; + } else { + if (virNodeDeviceReAttach(device) < 0) + ret = false; + } + + if (ret) { vshPrint(ctl, _("Device %s re-attached\n"), name); } else { vshError(ctl, _("Failed to re-attach device %s"), name); -- 1.7.11.7

This includes adding it to the nodedev parser and formatter, docs, and test. --- docs/formatnode.html.in | 63 +++++++++++++++- docs/schemas/nodedev.rng | 11 +++ src/conf/node_device_conf.c | 86 +++++++++++++++++++++- src/conf/node_device_conf.h | 5 +- src/node_device/node_device_udev.c | 21 +++++- tests/nodedevschemadata/pci_8086_10c9_sriov_pf.xml | 16 ++++ tests/nodedevxml2xmltest.c | 1 + 7 files changed, 199 insertions(+), 4 deletions(-) create mode 100644 tests/nodedevschemadata/pci_8086_10c9_sriov_pf.xml diff --git a/docs/formatnode.html.in b/docs/formatnode.html.in index 11654e5..ad05a70 100644 --- a/docs/formatnode.html.in +++ b/docs/formatnode.html.in @@ -80,6 +80,36 @@ <dd>Vendor details from the device ROM, including an attribute <code>id</code> with the hexadecimal vendor id, and an optional text name of that vendor.</dd> + <dt><code>iommuGroup</code></dt> + <dd> + This optional element describes the "IOMMU group" this + device belongs to. If the element exists, it has a + mandatory <code>number</code> attribute which tells + the group number used for management of the group (all + devices in group "n" will be found in + "/sys/kernel/iommu_groups/n"). It will also have a + list of <code>address</code> subelements, each + containing the PCI address of a device in the same + group. The toplevel device will itself be included in + this list. + </dd> + <dt><code>capability</code></dt> + <dd> + This optional element can occur multiple times. If it + exists, it has a mandatory <code>type</code> attribute + which will be set to + either <code>physical_function</code> + or <code>virtual_functions</code>. If the type + is <code>physical_function</code>, there will be a + single <code>address</code> subelement which contains + the PCI address of the SRIOV Physical Function (PF) + that is the parent of this device (and this device is, + by implication, an SRIOV Virtual Function (VF)). If + the type is <code>virtual_functions</code>, then this + device is an SRIOV PF, and the capability element will + have a list of <code>address</code> subelements, one + for each VF on this PF. + </dd> </dl> </dd> <dt><code>usb_device</code></dt> @@ -232,7 +262,38 @@ <address>00:27:13:6a:fe:00</address> <capability type='80203'/> </capability> -</device></pre> +</device> + +<device> + <name>pci_0000_02_00_0</name> + <path>/sys/devices/pci0000:00/0000:00:04.0/0000:02:00.0</path> + <parent>pci_0000_00_04_0</parent> + <driver> + <name>igb</name> + </driver> + <capability type='pci'> + <domain>0</domain> + <bus>2</bus> + <slot>0</slot> + <function>0</function> + <product id='0x10c9'>82576 Gigabit Network Connection</product> + <vendor id='0x8086'>Intel Corporation</vendor> + <capability type='virt_functions'> + <address domain='0x0000' bus='0x02' slot='0x10' function='0x0'/> + <address domain='0x0000' bus='0x02' slot='0x10' function='0x2'/> + <address domain='0x0000' bus='0x02' slot='0x10' function='0x4'/> + <address domain='0x0000' bus='0x02' slot='0x10' function='0x6'/> + <address domain='0x0000' bus='0x02' slot='0x11' function='0x0'/> + <address domain='0x0000' bus='0x02' slot='0x11' function='0x2'/> + <address domain='0x0000' bus='0x02' slot='0x11' function='0x4'/> + </capability> + <iommuGroup number='12'> + <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> + <address domain='0x0000' bus='0x02' slot='0x00' function='0x1'/> + </iommuGroup> + </capability> +</device> + </pre> </body> </html> diff --git a/docs/schemas/nodedev.rng b/docs/schemas/nodedev.rng index 1f5dcb9..d2bebff 100644 --- a/docs/schemas/nodedev.rng +++ b/docs/schemas/nodedev.rng @@ -144,6 +144,17 @@ </element> </optional> + <optional> + <element name='iommuGroup'> + <attribute name='number'> + <ref name='unsignedInt'/> + </attribute> + <oneOrMore> + <ref name='address'/> + </oneOrMore> + </element> + </optional> + </define> <define name='capusbdev'> diff --git a/src/conf/node_device_conf.c b/src/conf/node_device_conf.c index 548cd8f..96742ef 100644 --- a/src/conf/node_device_conf.c +++ b/src/conf/node_device_conf.c @@ -1,6 +1,7 @@ /* * node_device_conf.c: config handling for node devices * + * Copyright (C) 2009-2013 Red Hat, Inc. * Copyright (C) 2008 Virtual Iron Software, Inc. * Copyright (C) 2008 David F. Lively * @@ -31,6 +32,7 @@ #include "viralloc.h" #include "virstring.h" #include "node_device_conf.h" +#include "device_conf.h" #include "virxml.h" #include "virbuffer.h" #include "viruuid.h" @@ -330,6 +332,20 @@ char *virNodeDeviceDefFormat(const virNodeDeviceDefPtr def) } virBufferAddLit(&buf, " </capability>\n"); } + if (data->pci_dev.nIommuGroupDevices) { + virBufferAsprintf(&buf, " <iommuGroup number='%d'>\n", + data->pci_dev.iommuGroupNumber); + for (i = 0; i < data->pci_dev.nIommuGroupDevices; i++) { + virBufferAsprintf(&buf, + " <address domain='0x%.4x' bus='0x%.2x' " + "slot='0x%.2x' function='0x%.1x'/>\n", + data->pci_dev.iommuGroupDevices[i]->domain, + data->pci_dev.iommuGroupDevices[i]->bus, + data->pci_dev.iommuGroupDevices[i]->slot, + data->pci_dev.iommuGroupDevices[i]->function); + } + virBufferAddLit(&buf, " </iommuGroup>\n"); + } break; case VIR_NODE_DEV_CAP_USB_DEV: virBufferAsprintf(&buf, " <bus>%d</bus>\n", data->usb_dev.bus); @@ -966,12 +982,71 @@ out: } static int +virNodeDevCapPciDevIommuGroupParseXML(xmlXPathContextPtr ctxt, + xmlNodePtr iommuGroupNode, + union _virNodeDevCapData *data) +{ + xmlNodePtr origNode = ctxt->node; + xmlNodePtr *addrNodes = NULL; + char *numberStr = NULL; + int nAddrNodes, ii, ret = -1; + virPCIDeviceAddressPtr pciAddr = NULL; + + ctxt->node = iommuGroupNode; + + numberStr = virXMLPropString(iommuGroupNode, "number"); + if (!numberStr) { + virReportError(VIR_ERR_XML_ERROR, + "%s", _("missing iommuGroup number attribute")); + goto cleanup; + } + if (virStrToLong_ui(numberStr, NULL, 10, + &data->pci_dev.iommuGroupNumber) < 0) { + virReportError(VIR_ERR_XML_ERROR, + _("invalid iommuGroup number attribute '%s'"), + numberStr); + goto cleanup; + } + + if ((nAddrNodes = virXPathNodeSet("./address", ctxt, &addrNodes)) < 0) + goto cleanup; + + for (ii = 0; ii < nAddrNodes; ii++) { + virDevicePCIAddress addr = { 0, 0, 0, 0, 0 }; + if (virDevicePCIAddressParseXML(addrNodes[ii], &addr) < 0) + goto cleanup; + if (VIR_ALLOC(pciAddr) < 0) { + virReportOOMError(); + goto cleanup; + } + pciAddr->domain = addr.domain; + pciAddr->bus = addr.bus; + pciAddr->slot = addr.slot; + pciAddr->function = addr.function; + if (VIR_APPEND_ELEMENT(data->pci_dev.iommuGroupDevices, + data->pci_dev.nIommuGroupDevices, + pciAddr) < 0) { + virReportOOMError(); + goto cleanup; + } + } + + ret = 0; +cleanup: + ctxt->node = origNode; + VIR_FREE(addrNodes); + VIR_FREE(pciAddr); + return ret; +} + + +static int virNodeDevCapPciDevParseXML(xmlXPathContextPtr ctxt, virNodeDeviceDefPtr def, xmlNodePtr node, union _virNodeDevCapData *data) { - xmlNodePtr orignode; + xmlNodePtr orignode, iommuGroupNode; int ret = -1; orignode = ctxt->node; @@ -1016,6 +1091,12 @@ virNodeDevCapPciDevParseXML(xmlXPathContextPtr ctxt, data->pci_dev.vendor_name = virXPathString("string(./vendor[1])", ctxt); data->pci_dev.product_name = virXPathString("string(./product[1])", ctxt); + if ((iommuGroupNode = virXPathNode("./iommuGroup[1]", ctxt))) { + if (virNodeDevCapPciDevIommuGroupParseXML(ctxt, iommuGroupNode, + data) < 0) { + goto out; + } + } ret = 0; out: ctxt->node = orignode; @@ -1385,6 +1466,9 @@ void virNodeDevCapsDefFree(virNodeDevCapsDefPtr caps) for (i = 0; i < data->pci_dev.num_virtual_functions; i++) { VIR_FREE(data->pci_dev.virtual_functions[i]); } + for (i = 0; i < data->pci_dev.nIommuGroupDevices; i++) { + VIR_FREE(data->pci_dev.iommuGroupDevices[i]); + } break; case VIR_NODE_DEV_CAP_USB_DEV: VIR_FREE(data->usb_dev.product_name); diff --git a/src/conf/node_device_conf.h b/src/conf/node_device_conf.h index 17240be..ec35da2 100644 --- a/src/conf/node_device_conf.h +++ b/src/conf/node_device_conf.h @@ -1,7 +1,7 @@ /* * node_device_conf.h: config handling for node devices * - * Copyright (C) 2010-2011 Red Hat, Inc. + * Copyright (C) 2009-2013 Red Hat, Inc. * Copyright (C) 2008 Virtual Iron Software, Inc. * Copyright (C) 2008 David F. Lively * @@ -112,6 +112,9 @@ struct _virNodeDevCapsDef { virPCIDeviceAddressPtr *virtual_functions; unsigned int num_virtual_functions; unsigned int flags; + virPCIDeviceAddressPtr *iommuGroupDevices; + size_t nIommuGroupDevices; + unsigned int iommuGroupNumber; } pci_dev; struct { unsigned int bus; diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index b94be80..9bb4a17 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -421,7 +421,8 @@ static int udevProcessPCI(struct udev_device *device, { const char *syspath = NULL; union _virNodeDevCapData *data = &def->caps->data; - int ret = -1; + virPCIDeviceAddress addr; + int tmpGroup, ret = -1; char *p; int rc; @@ -501,6 +502,24 @@ static int udevProcessPCI(struct udev_device *device, else if (!rc && (data->pci_dev.num_virtual_functions > 0)) data->pci_dev.flags |= VIR_NODE_DEV_CAP_FLAG_PCI_VIRTUAL_FUNCTION; + /* iommu group */ + addr.domain = data->pci_dev.domain; + addr.bus = data->pci_dev.bus; + addr.slot = data->pci_dev.slot; + addr.function = data->pci_dev.function; + tmpGroup = virPCIGetIOMMUGroupNum(&addr); + if (tmpGroup == -1) { + /* error was already reported */ + goto out; + /* -2 return means there is no iommu_group data */ + } else if (tmpGroup >= 0) { + if (virPCIGetIOMMUGroupAddresses(&addr, &data->pci_dev.iommuGroupDevices, + &data->pci_dev.nIommuGroupDevices) < 0) { + goto out; + } + data->pci_dev.iommuGroupNumber = tmpGroup; + } + ret = 0; out: diff --git a/tests/nodedevschemadata/pci_8086_10c9_sriov_pf.xml b/tests/nodedevschemadata/pci_8086_10c9_sriov_pf.xml new file mode 100644 index 0000000..eff8932 --- /dev/null +++ b/tests/nodedevschemadata/pci_8086_10c9_sriov_pf.xml @@ -0,0 +1,16 @@ +<device> + <name>pci_0000_02_00_0</name> + <parent>pci_0000_00_04_0</parent> + <capability type='pci'> + <domain>0</domain> + <bus>2</bus> + <slot>0</slot> + <function>0</function> + <product id='0x10c9'>82576 Gigabit Network Connection</product> + <vendor id='0x8086'>Intel Corporation</vendor> + <iommuGroup number='12'> + <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> + <address domain='0x0000' bus='0x02' slot='0x00' function='0x1'/> + </iommuGroup> + </capability> +</device> diff --git a/tests/nodedevxml2xmltest.c b/tests/nodedevxml2xmltest.c index e81c617..ed49857 100644 --- a/tests/nodedevxml2xmltest.c +++ b/tests/nodedevxml2xmltest.c @@ -78,6 +78,7 @@ mymain(void) DO_TEST("net_00_13_02_b9_f9_d3"); DO_TEST("net_00_15_58_2f_e9_55"); DO_TEST("pci_1002_71c4"); + DO_TEST("pci_8086_10c9_sriov_pf"); DO_TEST("pci_8086_27c5_scsi_host_0"); DO_TEST("pci_8086_27c5_scsi_host_scsi_device_lun0"); DO_TEST("pci_8086_27c5_scsi_host_scsi_host"); -- 1.7.11.7

The "group" attribute of the <driver> element provides a method to easily detach all devices in an IOMMU group from the host and bind them to the vfio-pci driver so that one or more of those devices can be assigned to a guest. Because this operation makes all the devices in the group unusable by the host, it is not the default mode of operation, but must be consciously selected in the config by the user. This patch is only the parser/formatter, RNG, docs and XML tests. It does not actually tie this new option in to the driver - that will be in the next patch. --- docs/formatdomain.html.in | 94 +++++++++++++++++++--- docs/formatnetwork.html.in | 11 +++ docs/schemas/domaincommon.rng | 16 ++++ docs/schemas/network.rng | 8 ++ src/conf/domain_conf.c | 36 ++++++++- src/conf/domain_conf.h | 13 +++ src/conf/network_conf.c | 39 ++++++++- src/conf/network_conf.h | 14 ++++ tests/networkxml2xmlin/hostdev-pf.xml | 2 +- tests/networkxml2xmlout/hostdev-pf.xml | 2 +- .../qemuxml2argvdata/qemuxml2argv-hostdev-vfio.xml | 2 +- .../qemuxml2argv-net-hostdev-vfio.xml | 2 +- 12 files changed, 220 insertions(+), 19 deletions(-) diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 77126a5..f8d662a 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -2515,18 +2515,60 @@ more details on the address element.</dd> <dt><code>driver</code></dt> <dd> - PCI devices can have an optional <code>driver</code> - subelement that specifies which backend driver to use for PCI - device assignment. Use the <code>name</code> attribute to - select either "vfio" (for the new VFIO device assignment - backend, which is compatible with UEFI SecureBoot) or "kvm" - (for the legacy device assignment handled directly by the KVM - kernel module)<span class="since">Since 1.0.5 (QEMU and KVM - only, requires kernel 3.6 or newer)</span>. Currently, "kvm" - is the default used by libvirt when not explicitly provided, - but since the two are functionally equivalent, this default - could be changed in the future with no impact to domains that - don't specify anything. + <p> + PCI devices can have an optional <code>driver</code> + subelement that specifies which backend driver to use for PCI + device assignment. Use the <code>name</code> attribute to + select either "vfio" (for the new VFIO device assignment + backend, which is compatible with UEFI SecureBoot) or "kvm" + (for the legacy device assignment handled directly by the KVM + kernel module)<span class="since">Since 1.0.5 (QEMU and KVM + only, requires kernel 3.6 or newer)</span>. Currently, "kvm" + is the default used by libvirt when not explicitly provided, + but since the two are functionally equivalent, this default + could be changed in the future with no impact to domains that + don't specify anything. + </p> + <p> + When <code>name='vfio'</code> is specified, the + attribute <code>group</code> can also be optionally given in + the <code><driver></code> + element<span class="since">Since 1.0.7 (QEMU and KVM only, + requires kernel 3.6 or newer)</span>. <code>group</code> + instructs libvirt how to deal with multiple devices in a + single "IOMMU group" when it is doing managed device + assignment. Normally libvirt will only bind the single + device being assigned to the vfio-pci driver. VFIO uses the + concept of device "groups" to describe sets of devices that + cannot be safely used by multiple guests and/or the host + without compromising the security of one or the other (due + to shared DMA resources, etc); VFIO will not allow such + mixing to happen, and one way that it enforces this is to + require that, before any device in a group can be assigned + to a guest, <b>all</b> devices in the group must be detached + from their respective host drivers and be attached instead + to the vfio-pci driver (or some other driver that VFIO deems + to adequately sequester the device - currently the pci-stub + and pcieport drivers are acceptable alternatives). To make + this happen automatically (keeping in mind that the host + will lose all use of the devices in the group), simply + add <code>group='auto'</code> to + the <code><driver></code> element; libvirt will then + automatically detach from the host all devices in the same + group as the device being assigned, then bind them all to + the vfio-pci driver; any or all of those devices can then be + assigned to the same guest, but they cannot be used by the + host nor can they be assigned to any other guest.<br/><br/> + + <b>Extreme care</b> should be taken when + using <code>group='auto'</code>, to assure you are not + detaching the driver for a device that is currently in use + by the host (for example, the driver to the system disk or + something similar). The output of "virsh nodedev-dumpxml" + for the device you will be assigning will provide a list of + all devices in the same IOMMU group. + </p> + </dd> <dt><code>readonly</code></dt> <dd>Indicates that the device is readonly, only supported by SCSI host @@ -3156,6 +3198,15 @@ </p> <p> + When <code>name='vfio'</code> is specified, the + attribute <code>group</code> can also be optionally given in + the <code><driver></code> element. See the description of + this attribute in the <a href="#elementsHostDev">Host Device + Assignment</a> section above. <span class="since">Since 1.0.7 + (QEMU and KVM only, requires kernel 3.6 or newer)</span>. + </p> + + <p> Note that this "intelligent passthrough" of network devices is very similar to the functionality of a standard <hostdev> device, the difference being that this method allows specifying @@ -3173,7 +3224,7 @@ ... <devices> <interface type='hostdev'> - <driver name='vfio'/> + <driver name='vfio' group='auto'/> <source> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </source> @@ -3318,6 +3369,23 @@ qemu-kvm -net nic,model=? /dev/null kernel 3.6 or newer)</span> </dd> + <dt><code>group</code></dt> + <dd> + <b>Only for interfaces of type='hostdev'</b>, and only when + the interface element has <code>managed='yes'</code> set, if + the driver name has been set to "vfio", the <code>group</code> + attribute can be set to "auto" in order to automatically + detach from the host all devices in the same iommu group as + this device, and bind them to the vfio-pci driver. See the + description of this attribute in + the <a href="#elementsHostDev">Host Device Assignment</a> + section above. <span class="since">Since 1.0.7 (QEMU and KVM + only, requires kernel 3.6 or newer)</span>.<br/><br/> + + <b>In general you should leave this option alone, unless you + are very certain you know what you are doing.</b> + </dd> + <dt><code>txmode</code></dt> <dd> The <code>txmode</code> attribute specifies how to handle diff --git a/docs/formatnetwork.html.in b/docs/formatnetwork.html.in index a1198ce..a7eff7e 100644 --- a/docs/formatnetwork.html.in +++ b/docs/formatnetwork.html.in @@ -295,6 +295,17 @@ <span class="since">Since 1.0.5 (QEMU and KVM only, requires kernel 3.6 or newer)</span> </p> + <p> + When <code>name='vfio'</code> is specified, the + attribute <code>group</code> can also be optionally + given in the <code><driver></code> element. See + the description of this attribute in the description of + domain <a href="formatdomain.html#elementsHostDev">Host + Device Assignment</a>. <span class="since">Since 1.0.7 + (QEMU and KVM only, requires kernel 3.6 or + newer)</span>. + </p> + <p>Note that this "intelligent passthrough" of network devices is very similar to the functionality of a standard <code>< hostdev></code> device, the diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index cf82878..0d0e37e 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -2000,6 +2000,14 @@ <value>vfio</value> </choice> </attribute> + <optional> + <attribute name="group"> + <choice> + <value>manual</value> + <value>auto</value> + </choice> + </attribute> + </optional> </group> <group> <optional> @@ -3213,6 +3221,14 @@ <value>vfio</value> </choice> </attribute> + <optional> + <attribute name="group"> + <choice> + <value>manual</value> + <value>auto</value> + </choice> + </attribute> + </optional> <empty/> </element> </optional> diff --git a/docs/schemas/network.rng b/docs/schemas/network.rng index ded8580..9b786ae 100644 --- a/docs/schemas/network.rng +++ b/docs/schemas/network.rng @@ -157,6 +157,14 @@ <value>vfio</value> </choice> </attribute> + <optional> + <attribute name="group"> + <choice> + <value>manual</value> + <value>auto</value> + </choice> + </attribute> + </optional> <empty/> </element> </optional> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index e41dfa2..997d68d 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -606,6 +606,12 @@ VIR_ENUM_IMPL(virDomainHostdevSubsysPciBackend, "kvm", "vfio") +VIR_ENUM_IMPL(virDomainHostdevSubsysPciGroupMgmt, + VIR_DOMAIN_HOSTDEV_PCI_GROUP_MGMT_LAST, + "default", + "manual", + "auto") + VIR_ENUM_IMPL(virDomainHostdevCaps, VIR_DOMAIN_HOSTDEV_CAPS_TYPE_LAST, "storage", "misc", @@ -3932,6 +3938,8 @@ virDomainHostdevDefParseXMLSubsys(xmlNodePtr node, char *sgio = NULL; char *backendStr = NULL; int backend; + char *groupMgmtStr = NULL; + int groupMgmt; int ret = -1; /* @managed can be read from the xml document - it is always an @@ -4014,6 +4022,18 @@ virDomainHostdevDefParseXMLSubsys(xmlNodePtr node, } def->source.subsys.u.pci.backend = backend; + groupMgmt = VIR_DOMAIN_HOSTDEV_PCI_GROUP_MGMT_DEFAULT; + if ((groupMgmtStr = virXPathString("string(./driver/@group)", ctxt)) && + (((groupMgmt = virDomainHostdevSubsysPciGroupMgmtTypeFromString(groupMgmtStr)) < 0) || + groupMgmt == VIR_DOMAIN_HOSTDEV_PCI_GROUP_MGMT_DEFAULT)) { + virReportError(VIR_ERR_XML_ERROR, + _("Unknown PCI device group management method " + "<driver group='%s'/> has been specified"), + groupMgmtStr); + goto error; + } + def->source.subsys.u.pci.groupMgmt = groupMgmt; + break; case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_USB: @@ -4038,6 +4058,7 @@ error: VIR_FREE(managed); VIR_FREE(sgio); VIR_FREE(backendStr); + VIR_FREE(groupMgmtStr); return ret; } @@ -14361,6 +14382,8 @@ virDomainHostdevDefFormatSubsys(virBufferPtr buf, if (def->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && def->source.subsys.u.pci.backend != VIR_DOMAIN_HOSTDEV_PCI_BACKEND_DEFAULT) { const char *backend = virDomainHostdevSubsysPciBackendTypeToString(def->source.subsys.u.pci.backend); + const char *groupMgmt + = virDomainHostdevSubsysPciGroupMgmtTypeToString(def->source.subsys.u.pci.groupMgmt); if (!backend) { virReportError(VIR_ERR_INTERNAL_ERROR, @@ -14368,7 +14391,18 @@ virDomainHostdevDefFormatSubsys(virBufferPtr buf, def->source.subsys.u.pci.backend); return -1; } - virBufferAsprintf(buf, "<driver name='%s'/>\n", backend); + virBufferAsprintf(buf, "<driver name='%s'", backend); + if (def->source.subsys.u.pci.groupMgmt != VIR_DOMAIN_HOSTDEV_PCI_GROUP_MGMT_DEFAULT) { + if (!groupMgmt) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("unexpected pci hostdev driver group " + "management method %d"), + def->source.subsys.u.pci.groupMgmt); + return -1; + } + virBufferAsprintf(buf, " group='%s'", groupMgmt); + } + virBufferAddLit(buf, "/>\n"); } virBufferAddLit(buf, "<source"); diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 3817e37..04bddcc 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -400,6 +400,18 @@ typedef enum { VIR_ENUM_DECL(virDomainHostdevSubsysPciBackend) +/* how to manage multiple devices in the same iommu group */ +typedef enum { + VIR_DOMAIN_HOSTDEV_PCI_GROUP_MGMT_DEFAULT, /* default is "manual" */ + /* don't automatically detach all devices in group */ + VIR_DOMAIN_HOSTDEV_PCI_GROUP_MGMT_MANUAL, + /* *do* automatically detach/attach all devices in group */ + VIR_DOMAIN_HOSTDEV_PCI_GROUP_MGMT_AUTO, + VIR_DOMAIN_HOSTDEV_PCI_GROUP_MGMT_LAST +} virDomainHostdevSubsysPciGroupMgmtType; + +VIR_ENUM_DECL(virDomainHostdevSubsysPciGroupMgmt) + typedef struct _virDomainHostdevSubsys virDomainHostdevSubsys; typedef virDomainHostdevSubsys *virDomainHostdevSubsysPtr; struct _virDomainHostdevSubsys { @@ -417,6 +429,7 @@ struct _virDomainHostdevSubsys { struct { virDevicePCIAddress addr; /* host address */ int backend; /* enum virDomainHostdevSubsysPciBackendType */ + int groupMgmt; /* enum virDomainHostdevSubsysPciGroupMgmtType */ } pci; struct { char *adapter; diff --git a/src/conf/network_conf.c b/src/conf/network_conf.c index 2b4845c..16d2386 100644 --- a/src/conf/network_conf.c +++ b/src/conf/network_conf.c @@ -66,6 +66,12 @@ VIR_ENUM_IMPL(virNetworkForwardDriverName, "kvm", "vfio") +VIR_ENUM_IMPL(virNetworkForwardDriverGroupMgmt, + VIR_NETWORK_FORWARD_DRIVER_GROUP_MGMT_LAST, + "default", + "manual", + "auto") + virNetworkObjPtr virNetworkFindByUUID(const virNetworkObjListPtr nets, const unsigned char *uuid) { @@ -1688,6 +1694,7 @@ virNetworkForwardDefParseXML(const char *networkName, char *forwardDev = NULL; char *forwardManaged = NULL; char *forwardDriverName = NULL; + char *forwardDriverGroupMgmt = NULL; char *type = NULL; xmlNodePtr save = ctxt->node; @@ -1725,6 +1732,21 @@ virNetworkForwardDefParseXML(const char *networkName, def->driverName = driverName; } + forwardDriverGroupMgmt = virXPathString("string(./driver/@group)", ctxt); + if (forwardDriverGroupMgmt) { + int driverGroupMgmt + = virNetworkForwardDriverGroupMgmtTypeFromString(forwardDriverGroupMgmt); + + if (driverGroupMgmt <= 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unknown forward <driver group='%s'/> " + "in network %s"), + forwardDriverGroupMgmt, networkName); + goto cleanup; + } + def->driverGroupMgmt = driverGroupMgmt; + } + /* bridge and hostdev modes can use a pool of physical interfaces */ nForwardIfs = virXPathNodeSet("./interface", ctxt, &forwardIfNodes); if (nForwardIfs < 0) { @@ -1907,6 +1929,7 @@ cleanup: VIR_FREE(forwardDev); VIR_FREE(forwardManaged); VIR_FREE(forwardDriverName); + VIR_FREE(forwardDriverGroupMgmt); VIR_FREE(forwardPfNodes); VIR_FREE(forwardIfNodes); VIR_FREE(forwardAddrNodes); @@ -2619,7 +2642,21 @@ virNetworkDefFormatInternal(virBufferPtr buf, def->forward.driverName); goto error; } - virBufferAsprintf(buf, "<driver name='%s'/>\n", driverName); + virBufferAsprintf(buf, "<driver name='%s'", driverName); + if (def->forward.driverGroupMgmt + != VIR_NETWORK_FORWARD_DRIVER_GROUP_MGMT_DEFAULT) { + const char *driverGroupMgmt + = virNetworkForwardDriverGroupMgmtTypeToString(def->forward.driverGroupMgmt); + + if (!driverGroupMgmt) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("unexpected hostdev driver group management type %d"), + def->forward.driverGroupMgmt); + goto error; + } + virBufferAsprintf(buf, " group='%s'", driverGroupMgmt); + } + virBufferAddLit(buf, "/>\n"); } if (def->forward.type == VIR_NETWORK_FORWARD_NAT) { if (virNetworkForwardNatDefFormat(buf, &def->forward) < 0) diff --git a/src/conf/network_conf.h b/src/conf/network_conf.h index 43f80d4..3a879f1 100644 --- a/src/conf/network_conf.h +++ b/src/conf/network_conf.h @@ -76,6 +76,19 @@ typedef enum { VIR_ENUM_DECL(virNetworkForwardDriverName) +/* how to manage multiple devices in the same iommu group */ +typedef enum { + VIR_NETWORK_FORWARD_DRIVER_GROUP_MGMT_DEFAULT, /* default is "manual" */ + /* don't automatically detach all devices in group */ + VIR_NETWORK_FORWARD_DRIVER_GROUP_MGMT_MANUAL, + /* *do* automatically detach/attach all devices in group */ + VIR_NETWORK_FORWARD_DRIVER_GROUP_MGMT_AUTO, + + VIR_NETWORK_FORWARD_DRIVER_GROUP_MGMT_LAST +} virNetworkForwardDriverGroupMgmtType; + +VIR_ENUM_DECL(virNetworkForwardDriverGroupMgmt) + typedef struct _virNetworkDHCPHostDef virNetworkDHCPHostDef; typedef virNetworkDHCPHostDef *virNetworkDHCPHostDefPtr; struct _virNetworkDHCPHostDef { @@ -193,6 +206,7 @@ struct _virNetworkForwardDef { int type; /* One of virNetworkForwardType constants */ bool managed; /* managed attribute for hostdev mode */ int driverName; /* enum virNetworkForwardDriverNameType */ + int driverGroupMgmt; /* enum virNetworkForwardDriverGroupMgmtType */ /* If there are multiple forward devices (i.e. a pool of * interfaces), they will be listed here. diff --git a/tests/networkxml2xmlin/hostdev-pf.xml b/tests/networkxml2xmlin/hostdev-pf.xml index 5b8f598..fd38556 100644 --- a/tests/networkxml2xmlin/hostdev-pf.xml +++ b/tests/networkxml2xmlin/hostdev-pf.xml @@ -2,7 +2,7 @@ <name>hostdev</name> <uuid>81ff0d90-c91e-6742-64da-4a736edb9a9b</uuid> <forward mode='hostdev' managed='yes'> - <driver name='vfio'/> + <driver name='vfio' group='auto'/> <pf dev='eth2'/> </forward> </network> diff --git a/tests/networkxml2xmlout/hostdev-pf.xml b/tests/networkxml2xmlout/hostdev-pf.xml index 5b8f598..fd38556 100644 --- a/tests/networkxml2xmlout/hostdev-pf.xml +++ b/tests/networkxml2xmlout/hostdev-pf.xml @@ -2,7 +2,7 @@ <name>hostdev</name> <uuid>81ff0d90-c91e-6742-64da-4a736edb9a9b</uuid> <forward mode='hostdev' managed='yes'> - <driver name='vfio'/> + <driver name='vfio' group='auto'/> <pf dev='eth2'/> </forward> </network> diff --git a/tests/qemuxml2argvdata/qemuxml2argv-hostdev-vfio.xml b/tests/qemuxml2argvdata/qemuxml2argv-hostdev-vfio.xml index 8daa53a..453ab2c 100644 --- a/tests/qemuxml2argvdata/qemuxml2argv-hostdev-vfio.xml +++ b/tests/qemuxml2argvdata/qemuxml2argv-hostdev-vfio.xml @@ -23,7 +23,7 @@ <controller type='ide' index='0'/> <controller type='pci' index='0' model='pci-root'/> <hostdev mode='subsystem' type='pci' managed='yes'> - <driver name='vfio'/> + <driver name='vfio' group='auto'/> <source> <address domain='0x0000' bus='0x06' slot='0x12' function='0x5'/> </source> diff --git a/tests/qemuxml2argvdata/qemuxml2argv-net-hostdev-vfio.xml b/tests/qemuxml2argvdata/qemuxml2argv-net-hostdev-vfio.xml index 90419a4..3b8ff14 100644 --- a/tests/qemuxml2argvdata/qemuxml2argv-net-hostdev-vfio.xml +++ b/tests/qemuxml2argvdata/qemuxml2argv-net-hostdev-vfio.xml @@ -24,7 +24,7 @@ <controller type='pci' index='0' model='pci-root'/> <interface type='hostdev' managed='yes'> <mac address='00:11:22:33:44:55'/> - <driver name='vfio'/> + <driver name='vfio' group='auto'/> <source> <address type='pci' domain='0x0002' bus='0x03' slot='0x07' function='0x1'/> </source> -- 1.7.11.7

This connects the new <driver> group attribute into qemu in two places: 1) when a hostdev is allocated from a libvirt network, the setting of groupMgmt is transferred from the network definition into the new HostdevDef. 2) When the virPCIDeviceList of all hostdevs for the domain is created, each device's attach_group bool is set according to the value of groupMgmt in the HostdevDef. This is where virPCIDeviceDetach (and later virPCIDeviceReAttach) will find it. The effect of setting group='auto': If there are multiple devices in the same iommu group as the device you wish to assign to a guest using VFIO, you have two choices: a) Manually (with "virsh nodedev-detach --driver vfio" or by other means) detach each device in the group from the host and bind them all to either the vfio-pci driver or the pci-stub driver. Then define the <hostdev> (or <interface type='hostdev'>) in the guest config *without* "managed='yes'" - libvirt will expect that the device (and any other devices in the same group) will already be detached from the host and attached to a different driver appropriate for assigning to the guest. b) If you wish to use <hostdev managed='yes'>, then add "group='auto'" to the device's <driver> element: <driver name='vfio' group='auto'/> When it is time to assign the device to the guest, libvirt will check through the group list, and any device in the group not already bound to vfio-pci, pci-stub, or pcieport, will be detached from the host and bound to vfio-pci. Option (1) provides finer grained control over what driver is used for each device in a group, while option (2) is simpler. Note that if you actually want to assign all the devices in the group to a single guest, you can specify "group='auto'" for all of them, and libvirt will pay attention to which devices were already detached (and not attempt to re-detach them, which is just a waste of time), and also will not attempt to re-attach any of the devices in the group to the host until *all* of them have been detached from the guest and are free (and at that time all of them will be re-attached to the host at the same time). --- src/network/bridge_driver.c | 22 ++++++++++++++++++++++ src/qemu/qemu_hostdev.c | 8 ++++++++ 2 files changed, 30 insertions(+) diff --git a/src/network/bridge_driver.c b/src/network/bridge_driver.c index f7c2470..fbe0098 100644 --- a/src/network/bridge_driver.c +++ b/src/network/bridge_driver.c @@ -3945,6 +3945,7 @@ networkAllocateActualDevice(virDomainNetDefPtr iface) } else if (netdef->forward.type == VIR_NETWORK_FORWARD_HOSTDEV) { virDomainHostdevSubsysPciBackendType backend; + virDomainHostdevSubsysPciGroupMgmtType groupMgmt; if (!iface->data.network.actual && (VIR_ALLOC(iface->data.network.actual) < 0)) { @@ -4001,6 +4002,27 @@ networkAllocateActualDevice(virDomainNetDefPtr iface) iface->data.network.actual->data.hostdev.def.source.subsys.u.pci.backend = backend; + switch (netdef->forward.driverGroupMgmt) + { + case VIR_NETWORK_FORWARD_DRIVER_GROUP_MGMT_DEFAULT: + groupMgmt = VIR_DOMAIN_HOSTDEV_PCI_GROUP_MGMT_DEFAULT; + break; + case VIR_NETWORK_FORWARD_DRIVER_GROUP_MGMT_MANUAL: + groupMgmt = VIR_DOMAIN_HOSTDEV_PCI_GROUP_MGMT_MANUAL; + break; + case VIR_NETWORK_FORWARD_DRIVER_GROUP_MGMT_AUTO: + groupMgmt = VIR_DOMAIN_HOSTDEV_PCI_GROUP_MGMT_AUTO; + break; + default: + virReportError(VIR_ERR_INTERNAL_ERROR, + _("unrecognized driver group management value %d " + " in network '%s'"), + netdef->forward.driverGroupMgmt, netdef->name); + goto error; + } + iface->data.network.actual->data.hostdev.def.source.subsys.u.pci.groupMgmt + = groupMgmt; + /* merge virtualports from interface, network, and portgroup to * arrive at actual virtualport to use */ diff --git a/src/qemu/qemu_hostdev.c b/src/qemu/qemu_hostdev.c index 1abad59..2b0221d 100644 --- a/src/qemu/qemu_hostdev.c +++ b/src/qemu/qemu_hostdev.c @@ -74,6 +74,10 @@ qemuGetPciHostDeviceList(virDomainHostdevDefPtr *hostdevs, int nhostdevs) virObjectUnref(list); return NULL; } + if (hostdev->source.subsys.u.pci.groupMgmt + == VIR_DOMAIN_HOSTDEV_PCI_GROUP_MGMT_AUTO) { + virPCIDeviceSetAttachGroup(dev, true); + } } else { if (virPCIDeviceSetStubDriver(dev, "pci-stub") < 0) { virObjectUnref(list); @@ -167,6 +171,10 @@ int qemuUpdateActivePciHostdevs(virQEMUDriverPtr driver, == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) { if (virPCIDeviceSetStubDriver(dev, "vfio-pci") < 0) goto cleanup; + if (hostdev->source.subsys.u.pci.groupMgmt + == VIR_DOMAIN_HOSTDEV_PCI_GROUP_MGMT_AUTO) { + virPCIDeviceSetAttachGroup(dev, true); + } } else { if (virPCIDeviceSetStubDriver(dev, "pci-stub") < 0) goto cleanup; -- 1.7.11.7

On Mon, Jun 24, 2013 at 05:54:49AM -0400, Laine Stump wrote:
When I first put in support for VFIO device assignment, I didn't realize that groups of devices were quite as common as they actually are. In particular, I didn't know that often multiple seemingly-unrelated devices can end up in the same VFIO iommu group due to unlucky circumstances of hardware - they may share a dma controller which means that the devices can't truly be isolated from each other, and thus should not be simultaneously assigned to different guests (or even used by the host) - all of the devices in a group should be either assigned to the same guest or, if not assigned to the guest, should be isolated off in a driver to prevent them from being used by the host.
The following set of patches makes setting that up easier to deal with. The end result of all the patches is the following:
1) The virNodeDevice API will be able to detach or re-attach all the devices in a particular group with a single API call.
2) <hostdev managed='yes'>, <interface type='hostdev' managed='yes'>, and <interface type='network' managed='yes'> devices (where the network is itself a pool of SRIOV Virtual Functions) can specify:
<driver name='vfio' group='auto'/>
and libvirt will automatically detach (and bind to the 'vfio-pci' driver for assignment/isolation) all devices in the same group as the device being assigned. Likewise, when the device it detached from the guest, a check will be made and, if none of the devices in the same group as the device being detach is still in use by a guest
I am concerned that group='auto' is a really incredibly dangerous setting from the POV of operation of the host OS. I can just imagine forum postings / docs saying to use group=auto and people blindly following it without much inclination as to what will happen. They will be trying to assign a spare NIC to their guest; they'll get an error saying it can't be done since it is part of a group; they'll search google and find a recommendation to use group=auto to "fix" the problem. libvirt will see that their SATA controller & graphics card are part of the same group as the NIC and automatically detach them both from the host OS. Kaboom, the user is screwed. With traditional configs, even with managed=yes, you could be sure that only the single device in the XML would ever be touched. If there was a conflict due to other devices being on the same PCI bridge without FLR, then the device would safely fail to be assigned until the user had explicitly disconnected other devices from the host. We never attempted to automatically disconnect anything that was not part of the XML Following on from that, how does an application determine what other devices are present in the group associated with the device being assigned ? Are we exposing group membership info in the node device XML anywhere ? I'm not sure what else to suggest, other than to say we should not add this attribute, and require that the application/user explicitly disconnect any other devices in the same group from the host OS. Any other option I can think of just sounds too dangerous. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 06/24/2013 06:06 AM, Daniel P. Berrange wrote:
On Mon, Jun 24, 2013 at 05:54:49AM -0400, Laine Stump wrote:
When I first put in support for VFIO device assignment, I didn't realize that groups of devices were quite as common as they actually are. In particular, I didn't know that often multiple seemingly-unrelated devices can end up in the same VFIO iommu group due to unlucky circumstances of hardware - they may share a dma controller which means that the devices can't truly be isolated from each other, and thus should not be simultaneously assigned to different guests (or even used by the host) - all of the devices in a group should be either assigned to the same guest or, if not assigned to the guest, should be isolated off in a driver to prevent them from being used by the host.
The following set of patches makes setting that up easier to deal with. The end result of all the patches is the following:
1) The virNodeDevice API will be able to detach or re-attach all the devices in a particular group with a single API call.
2) <hostdev managed='yes'>, <interface type='hostdev' managed='yes'>, and <interface type='network' managed='yes'> devices (where the network is itself a pool of SRIOV Virtual Functions) can specify:
<driver name='vfio' group='auto'/>
and libvirt will automatically detach (and bind to the 'vfio-pci' driver for assignment/isolation) all devices in the same group as the device being assigned. Likewise, when the device it detached from the guest, a check will be made and, if none of the devices in the same group as the device being detach is still in use by a guest I am concerned that group='auto' is a really incredibly dangerous setting from the POV of operation of the host OS.
I can just imagine forum postings / docs saying to use group=auto and people blindly following it without much inclination as to what will happen. They will be trying to assign a spare NIC to their guest; they'll get an error saying it can't be done since it is part of a group; they'll search google and find a recommendation to use group=auto to "fix" the problem. libvirt will see that their SATA controller & graphics card are part of the same group as the NIC and automatically detach them both from the host OS. Kaboom, the user is screwed.
Yes, I understand (and share) your concern. These patches grew out of claims that there would be a regression in behavior if "managed='yes'" stopped working properly with devices that were in a group, and this was the only way I could see to make it work.
With traditional configs, even with managed=yes, you could be sure that only the single device in the XML would ever be touched. If there was a conflict due to other devices being on the same PCI bridge without FLR, then the device would safely fail to be assigned until the user had explicitly disconnected other devices from the host. We never attempted to automatically disconnect anything that was not part of the XML
Following on from that, how does an application determine what other devices are present in the group associated with the device being assigned ? Are we exposing group membership info in the node device XML anywhere ?
Yes. nodedev-dumpxml now has the following information for every device that's in a group (this is added in Patch 20/22): <iommuGroup number='12'> <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> <address domain='0x0000' bus='0x02' slot='0x00' function='0x1'/> </iommuGroup>
I'm not sure what else to suggest, other than to say we should not add this attribute, and require that the application/user explicitly disconnect any other devices in the same group from the host OS. Any other option I can think of just sounds too dangerous.
I would suggest adding some sort of "assignment white list" to libvirt's config, and requiring any device manipulated in any manner to be on that white list, but in a way it's already possible to create such a list - just manually detach all devices that will be assigned to a guest (and any devices in the same iommu groups) and stop using managed='yes'. (Just doing that was my original plan, but it had opposition due to the perceived regression in behavior.) Of course none of this device group management is required for what I see (maybe mistakenly so?) as the most common use of PCI device assignment - SRIOV Virtual Functions; each VF is always by itself in a group, so managed assignment will work for them with no problem. It's the odd devices integrated on the motherboard chipset, and cards plugged into certain combinations of slots, that end up being in groups. So what is usable/useful from this patchset? I think the addition of group information to nodedev-dumpxml is useful, and possibly the virNodeDeviceReAttachFlags() should be added in case we later come up with a different solution that requires passing a flag to virNodeDeviceReAttach() and need to backport it. And of course there are many patches in here that can be considered cleanups/bugfixes (4 is a bugfix, everything else up to 10 is a general cleanup, and 11 has utility functions useful for 20 (which adds the group information to nodedev-dumpxml). Really, everything except the patches marked with * are safe/useful: 01 syntax: virPCIDeviceFree is also a NOP for NULL args 02 pci: change stubDriver from const char* to char* 03 pci: new utility functions 04 pci: eliminate memory leak in virPCIDeviceReattach 05 pci: make virPCIDeviceDetach consistent in behavior 06 pci: eliminate repetitive path constructions in virPCIDeviceBindToStub 07 pci: eliminate unused driver arg from virPCIDeviceDetach 08 pci: update stubDriver name in virPCIDeviceBindToStub 09 pci: rename virPCIDeviceGetVFIOGroupDev to virPCIDeviceGetIOMMUGroupDev 10 pci: make virPCIParseDeviceAddress public 11 pci: new iommu_group functions * 12 pci: optionally detach/reattach all devices in a VFIO group * 13 API & qemu: add ability to detach an entire VFIO group of devices * 14 virsh: add option to detach entire group of devices * 15 API: new virNodeDeviceReAttachFlags * 16 API: implement RPC calls for virNodeDeviceReAttachFlags * 17 qemu: implement virNodeDeviceReAttachFlags * 18 xen: implement virNodeDeviceReAttachFlags * 19 virsh: add option to attach entire group of devices 20 nodedev: add iommuGroup to node device object * 21 conf: add <driver group='auto'> to hostdev, interface, and networks * 22 qemu: implement backend of <driver group='auto'/> Or does someone see a reasonable way to make some modification of the existing stuff less dangerous? Maybe a simple whitelist could be workable (but where should it go?) Or is there any way to determine which devices can safely be detached from their host drivers without disrupting the system? Alex, any ideas?

On Mon, Jun 24, 2013 at 10:40:48AM -0400, Laine Stump wrote:
On 06/24/2013 06:06 AM, Daniel P. Berrange wrote:
On Mon, Jun 24, 2013 at 05:54:49AM -0400, Laine Stump wrote:
When I first put in support for VFIO device assignment, I didn't realize that groups of devices were quite as common as they actually are. In particular, I didn't know that often multiple seemingly-unrelated devices can end up in the same VFIO iommu group due to unlucky circumstances of hardware - they may share a dma controller which means that the devices can't truly be isolated from each other, and thus should not be simultaneously assigned to different guests (or even used by the host) - all of the devices in a group should be either assigned to the same guest or, if not assigned to the guest, should be isolated off in a driver to prevent them from being used by the host.
The following set of patches makes setting that up easier to deal with. The end result of all the patches is the following:
1) The virNodeDevice API will be able to detach or re-attach all the devices in a particular group with a single API call.
2) <hostdev managed='yes'>, <interface type='hostdev' managed='yes'>, and <interface type='network' managed='yes'> devices (where the network is itself a pool of SRIOV Virtual Functions) can specify:
<driver name='vfio' group='auto'/>
and libvirt will automatically detach (and bind to the 'vfio-pci' driver for assignment/isolation) all devices in the same group as the device being assigned. Likewise, when the device it detached from the guest, a check will be made and, if none of the devices in the same group as the device being detach is still in use by a guest I am concerned that group='auto' is a really incredibly dangerous setting from the POV of operation of the host OS.
I can just imagine forum postings / docs saying to use group=auto and people blindly following it without much inclination as to what will happen. They will be trying to assign a spare NIC to their guest; they'll get an error saying it can't be done since it is part of a group; they'll search google and find a recommendation to use group=auto to "fix" the problem. libvirt will see that their SATA controller & graphics card are part of the same group as the NIC and automatically detach them both from the host OS. Kaboom, the user is screwed.
Yes, I understand (and share) your concern. These patches grew out of claims that there would be a regression in behavior if "managed='yes'" stopped working properly with devices that were in a group, and this was the only way I could see to make it work.
IIUC, 2 devices are in the same group if there is no way to assign them to different domains at the same time, eg due to lack of FLR. If that is true, then old non-VFIO would have refused to start the guest if you had assigned only one of the devices to the guest, even with managed=yes. So it doesn't sound like a regression to me - both with old style pci assign and with vfio, you would be required to either assign all the devices to the same guest, or manually unbind the non-assigned devices from the host to allow the guest to start.
With traditional configs, even with managed=yes, you could be sure that only the single device in the XML would ever be touched. If there was a conflict due to other devices being on the same PCI bridge without FLR, then the device would safely fail to be assigned until the user had explicitly disconnected other devices from the host. We never attempted to automatically disconnect anything that was not part of the XML
Following on from that, how does an application determine what other devices are present in the group associated with the device being assigned ? Are we exposing group membership info in the node device XML anywhere ?
Yes. nodedev-dumpxml now has the following information for every device that's in a group (this is added in Patch 20/22):
<iommuGroup number='12'> <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> <address domain='0x0000' bus='0x02' slot='0x00' function='0x1'/> </iommuGroup>
I'm not sure what else to suggest, other than to say we should not add this attribute, and require that the application/user explicitly disconnect any other devices in the same group from the host OS. Any other option I can think of just sounds too dangerous.
I would suggest adding some sort of "assignment white list" to libvirt's config, and requiring any device manipulated in any manner to be on that white list, but in a way it's already possible to create such a list - just manually detach all devices that will be assigned to a guest (and any devices in the same iommu groups) and stop using managed='yes'.
Yes, as you say, you can already setup an "assignment whitelist" simply by unbinding all allowed devices from the host & using managed=no.
(Just doing that was my original plan, but it had opposition due to the perceived regression in behavior.)
As above I don't see any regression in behaviour - non-VFIO case would not silently detach any devices not explicitly listed in the XML. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (3)
-
Daniel P. Berrange
-
Jiri Denemark
-
Laine Stump