[libvirt] [PATCH 0/4] PCI multifunction partial assignment support

(--- long post warning ---) This is a work that derived from the discussions I had with Laine Stump and Alex Williamson in [1]. I'll provide a quick gist below. ---------- Today, Libvirt does not have proper support for partial assignment of functions of passed-through PCI multifunction devices (hostdev with VFIO-PCI). By partial assignment I mean the guest being able to use just some, not all, virtual functions of the device. Even if the functions itself became useless in the host, the some functions might not be safe to be used by the guest, thus the user should be able to limit it. I mentioned 'proper' because today it is possible to get this done in Libvirt if we use 'managed=no' in the hostdevs. If the user makes the proper setup (i.e. detaching all IOMMU devices), and use managed='no', Libvirt will launch the guest just with the functions declared in the XML. The technical reason for this is simple: in virHostdevPreparePCIDevices() we do not take into account that multifunction PCI devices requires the whole IOMMU to be detached, not just the devices being declared in def->hostdevs. In this case, managed='yes' will not work in this scenario, causing errors in QEMU launch. The discussion I've started in [1] was motivated by my attempt of automatically detaching the IOMMU inside the prepare function with managed='yes' devices. Laine discarded this idea, arguing that the concept of partial assignment will cause user confusion if Libvirt starts to handle things without the user being fully aware. In [1] it was discussed the possibility of declaring the functions that won't be assigned to the guest in the XML, forcing the user to be aware that these functions will be lost in the host, as a possible approach for a solution. ----------- These series tries to solve the partial assignment of multifunction hostdev PCI devices by introducing a new hostdev attribute called 'assigned'. This is how it works: - it is a boolean value that will be efffective just for multifunction hostdev PCI devices, since there's no other occurrence for this kind of use in Libvirt. Trying to declare assign='yes|no' in any other PCI hostdev device will cause parse errors; - default value if the attribute is not present is 'assigned=yes'; - <address> element will be forbidden if the hostdev is declared with assigned='no'. This is to make more evident to the user that this is a function that the guest will NOT be using, with a bonus that we will not need to calculate an address that won't be used; - with this new mechanic, all the functions of the PCI multifunction device must be declared in the XML, even if they're not going to be assigned to the guest. This is being enforced inside the Prepare() function to avoid XML parsing errors and existing domains being erased. For consistency, this is being enforced regardless of the 'managed' flag - even the managed='no' case is going to adhere with this restriction. This is a XML snippet of this new attribute, applied to a hostdev passthrough of a multifunction Broadcom BCM5719 PCIe card with 4 functions: <hostdev mode='subsystem' type='pci' managed='yes' assigned='yes'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x09' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x01' slot='0x01' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes' assigned='no'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x09' slot='0x00' function='0x1'/> </source> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes' assigned='no'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x09' slot='0x00' function='0x2'/> </source> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes' assigned='yes'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x09' slot='0x00' function='0x3'/> </source> <address type='pci' domain='0x0000' bus='0x01' slot='0x01' function='0x3'/> </hostdev> lspci inside the guest: $ lspci 0000:00:01.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon 0000:00:03.0 USB controller: Red Hat, Inc. QEMU XHCI Host Controller (rev 01) 0000:00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device 0001:00:01.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 0001:00:01.3 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) $ The intent of this design is to forbid a hostdev with 'assigned=no' to be attached to QEMU, to have an alias and to have a guest address. For all other purposes, including the processing done in virHostdevPreparePCIDevices(), the device is like any other hostdev - it will be added to activePCIHostdevs and, most important, it will be detached from the host IOMMU during guest start-up and will be re-attach back to the host with guest shutdown. In this very first interaction of this idea I most definitely missed/forgot to consider something, so any feedback is appreciated. This has direct implications into the multifunction pci hotplug/hot-unplug work (in fact, it was the hotplug work that revealed this gap in Libvirt support - I first noticed it seeing failed partial hotplug scenarios) and I would like to see it figured out, with this or any other approach, before continuing with the hotplug work. Thanks, DHB [1] https://www.redhat.com/archives/libvir-list/2019-July/msg01175.html Daniel Henrique Barboza (4): utils: introducing PCI multifunction detection helpers Introducing assigned='yes|no' hostdev attribute qemu_command: ignore non-assigned PCI multifunction hostdevs virhostdev: all functions of a PCI multifunction dev must be added docs/schemas/domaincommon.rng | 5 + src/conf/domain_conf.c | 36 ++++++- src/conf/domain_conf.h | 1 + src/libvirt_private.syms | 2 + src/qemu/qemu_alias.c | 8 ++ src/qemu/qemu_command.c | 5 + src/qemu/qemu_domain_address.c | 7 ++ src/util/virhostdev.c | 88 +++++++++++++++++- src/util/virhostdev.h | 3 + src/util/virpci.c | 15 +++ src/util/virpci.h | 2 + ...ostdev-pci-multifunction-partial-fail.args | 31 ++++++ ...hostdev-pci-multifunction-partial-fail.xml | 35 +++++++ .../hostdev-pci-multifunction-partial.args | 31 ++++++ .../hostdev-pci-multifunction-partial.xml | 41 ++++++++ .../hostdev-pci-multifunction.args | 7 +- .../hostdev-pci-multifunction.xml | 6 ++ tests/qemuxml2argvtest.c | 8 ++ .../hostdev-pci-multifunction-partial.xml | 57 ++++++++++++ .../hostdev-pci-multifunction.xml | 23 +++-- .../qemuxml2xmloutdata/pseries-hostdevs-1.xml | 4 +- .../qemuxml2xmloutdata/pseries-hostdevs-2.xml | 4 +- .../qemuxml2xmloutdata/pseries-hostdevs-3.xml | 4 +- tests/qemuxml2xmltest.c | 1 + tests/virpcitestdata/0005-90-01.1.config | Bin 256 -> 256 bytes tests/virpcitestdata/0005-90-01.2.config | Bin 256 -> 256 bytes tests/virpcitestdata/0005-90-01.3.config | Bin 0 -> 256 bytes tools/virsh-domain.c | 5 + 28 files changed, 408 insertions(+), 21 deletions(-) create mode 100644 tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.args create mode 100644 tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.xml create mode 100644 tests/qemuxml2argvdata/hostdev-pci-multifunction-partial.args create mode 100644 tests/qemuxml2argvdata/hostdev-pci-multifunction-partial.xml create mode 100644 tests/qemuxml2xmloutdata/hostdev-pci-multifunction-partial.xml create mode 100644 tests/virpcitestdata/0005-90-01.3.config -- 2.21.0

This patch introduces two helpers that will be used in the next patches. One is virpci.c:virPCIDeviceIsMultifunction(), and the other is virhostdev.c:virHostdevIsPCIMultifunctionDevice(). As the name suggests, the idea is to detect if a given dev/hostdev is a PCI multifunction device. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/libvirt_private.syms | 2 ++ src/util/virhostdev.c | 24 ++++++++++++++++++++++++ src/util/virhostdev.h | 3 +++ src/util/virpci.c | 15 +++++++++++++++ src/util/virpci.h | 2 ++ 5 files changed, 46 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index eeab820eca..1c611ea8f4 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2106,6 +2106,7 @@ virHostCPUStatsAssign; # util/virhostdev.h virHostdevFindUSBDevice; virHostdevIsMdevDevice; +virHostdevIsPCIMultifunctionDevice; virHostdevIsSCSIDevice; virHostdevManagerGetDefault; virHostdevPCINodeDeviceDetach; @@ -2684,6 +2685,7 @@ virPCIDeviceGetUnbindFromStub; virPCIDeviceGetUsedBy; virPCIDeviceHasPCIExpressLink; virPCIDeviceIsAssignable; +virPCIDeviceIsMultifunction; virPCIDeviceIsPCIExpress; virPCIDeviceListAdd; virPCIDeviceListAddCopy; diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index 41fcab7222..1aa8e9729d 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -2191,3 +2191,27 @@ virHostdevUpdateActiveDomainDevices(virHostdevManagerPtr mgr, return 0; } + +bool +virHostdevIsPCIMultifunctionDevice(virDomainHostdevDefPtr hostdev) +{ + VIR_AUTOPTR(virPCIDevice) pciDev = NULL; + virDomainHostdevSubsysPCIPtr pcisrc = &hostdev->source.subsys.u.pci; + + if (hostdev->mode != VIR_DOMAIN_HOSTDEV_MODE_SUBSYS || + hostdev->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) + return false; + + /* Libvirt should be able to perform all the operations in + * virPCIDeviceNew() even if it's running unprivileged, so if this + * fails, the device apparently doesn't currently exist on the host. + * Since we can't speculate, assume this device is not multifunction. + */ + pciDev = virPCIDeviceNew(pcisrc->addr.domain, pcisrc->addr.bus, + pcisrc->addr.slot, pcisrc->addr.function); + + if (!pciDev) + return false; + + return virPCIDeviceIsMultifunction(pciDev); +} diff --git a/src/util/virhostdev.h b/src/util/virhostdev.h index 88501e2743..63a1d7367b 100644 --- a/src/util/virhostdev.h +++ b/src/util/virhostdev.h @@ -190,6 +190,9 @@ virHostdevIsSCSIDevice(virDomainHostdevDefPtr hostdev) bool virHostdevIsMdevDevice(virDomainHostdevDefPtr hostdev) ATTRIBUTE_NONNULL(1); +bool +virHostdevIsPCIMultifunctionDevice(virDomainHostdevDefPtr hostdev) + ATTRIBUTE_NONNULL(1); /* functions used by NodeDevDetach/Reattach/Reset */ int virHostdevPCINodeDeviceDetach(virHostdevManagerPtr mgr, diff --git a/src/util/virpci.c b/src/util/virpci.c index ee78151e74..778e47ea08 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -2863,6 +2863,21 @@ int virPCIGetHeaderType(virPCIDevicePtr dev, int *hdrType) return 0; } +bool +virPCIDeviceIsMultifunction(virPCIDevicePtr dev) +{ + int fd; + uint8_t type; + + if ((fd = virPCIDeviceConfigOpen(dev)) < 0) + return -1; + + type = virPCIDeviceRead8(dev, fd, PCI_HEADER_TYPE); + + virPCIDeviceConfigClose(dev, fd); + + return type & PCI_HEADER_TYPE_MULTI; +} void virPCIEDeviceInfoFree(virPCIEDeviceInfoPtr dev) diff --git a/src/util/virpci.h b/src/util/virpci.h index dc20f91710..7199882d6b 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -264,6 +264,8 @@ int virPCIDeviceGetLinkCapSta(virPCIDevicePtr dev, int virPCIGetHeaderType(virPCIDevicePtr dev, int *hdrType); +bool virPCIDeviceIsMultifunction(virPCIDevicePtr dev); + void virPCIEDeviceInfoFree(virPCIEDeviceInfoPtr dev); ssize_t virPCIGetMdevTypes(const char *sysfspath, -- 2.21.0

The idea of having an attribute that declares that a hostdev is not to be assigned to the guest seems counterintuitive, but it has a niche use with multifunction PCI devices. The current use of this kind of device in Libvirt is to declare all of the functions in the XML. This is not ideal though - some devices might have functions that are security sensitive to be shared to guests, but at the same time the user might want to passthrough the other functions to a guest. This is what we call 'partial assignment'. Libvirt does not have proper support for this scenario: when using managed='yes', this partial assignment will cause the code inside virhostdev.c to not detach all the IOMMU devices - given that we do not want all of them in the guest - and then we face a QEMU error because we didn't detached the whole IOMMU. An idea was discussed in [1] where Libvirt would automatically detach all the IOMMU devices in case any multifunction PCI hostdev is used in the guest, but this idea was discarded because removing user agency in this case is undesirable. The user must be aware of all the functions that are going to be detached from the host, even if they're not being assigned to the guest, to avoid scenarios in which Libvirt "all of a sudden" detaches something that the user didn't want to. This patch implements a new attribute called 'assigned' that indicates whether a hostdev is going to be assigned to the guest or not. To keep its use less intrusive for every other hostdev that does not need such control, the attribute can only be set by PCI multifunction devices. For existing domains prior to this change, assign='yes' will be implied. If the user decides not to assign a specific function to the guest using assign='no', the <address> field becomes unavailable and throws a parsing error if used. In the next patch we'll use this attribute to avoid assigning an assigned='no' device to QEMU at boot. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- Note: the changes in the .config files were done because the existing .config data files weren't being recognized as multifunction PCI devices after we started reading the PCI header. docs/schemas/domaincommon.rng | 5 +++ src/conf/domain_conf.c | 36 +++++++++++++++++- src/conf/domain_conf.h | 1 + .../hostdev-pci-multifunction.args | 7 ++-- .../hostdev-pci-multifunction.xml | 6 +++ .../hostdev-pci-multifunction.xml | 23 +++++++---- .../qemuxml2xmloutdata/pseries-hostdevs-1.xml | 4 +- .../qemuxml2xmloutdata/pseries-hostdevs-2.xml | 4 +- .../qemuxml2xmloutdata/pseries-hostdevs-3.xml | 4 +- tests/virpcitestdata/0005-90-01.1.config | Bin 256 -> 256 bytes tests/virpcitestdata/0005-90-01.2.config | Bin 256 -> 256 bytes tests/virpcitestdata/0005-90-01.3.config | Bin 0 -> 256 bytes tools/virsh-domain.c | 5 +++ 13 files changed, 76 insertions(+), 19 deletions(-) create mode 100644 tests/virpcitestdata/0005-90-01.3.config diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index 40eb4a2d75..d1a5c051e3 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -4629,6 +4629,11 @@ <attribute name="type"> <value>pci</value> </attribute> + <optional> + <attribute name="assigned"> + <ref name="virYesNo"/> + </attribute> + </optional> <interleave> <optional> <element name="driver"> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index a53cd6a725..8d80824a0a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -7755,7 +7755,6 @@ virDomainHostdevSubsysPCIDefParseXML(xmlNodePtr node, if (virXMLNodeNameEqual(cur, "address")) { virPCIDeviceAddressPtr addr = &def->source.subsys.u.pci.addr; - if (virPCIDeviceAddressParseXML(cur, addr) < 0) goto out; } else if ((flags & VIR_DOMAIN_DEF_PARSE_STATUS) && @@ -8147,7 +8146,7 @@ virDomainHostdevDefParseXMLSubsys(xmlNodePtr node, virDomainHostdevDefPtr def, unsigned int flags) { - xmlNodePtr sourcenode; + xmlNodePtr sourcenode, addressnode; int backend; virDomainHostdevSubsysPCIPtr pcisrc = &def->source.subsys.u.pci; virDomainHostdevSubsysSCSIPtr scsisrc = &def->source.subsys.u.scsi; @@ -8159,6 +8158,8 @@ virDomainHostdevDefParseXMLSubsys(xmlNodePtr node, VIR_AUTOFREE(char *) backendStr = NULL; VIR_AUTOFREE(char *) model = NULL; VIR_AUTOFREE(char *) display = NULL; + VIR_AUTOFREE(char *) assigned = NULL; + /* @managed can be read from the xml document - it is always an * attribute of the toplevel element, no matter what type of @@ -8289,6 +8290,32 @@ virDomainHostdevDefParseXMLSubsys(xmlNodePtr node, if (virDomainHostdevSubsysPCIDefParseXML(sourcenode, def, flags) < 0) return -1; + /* @assigned can only be set for multifunction PCI devices. + * In case the attribute is missing, always assume + * assigned = true. + */ + def->assigned = true; + + if ((assigned = virXMLPropString(node, "assigned")) != NULL) { + if (!virHostdevIsPCIMultifunctionDevice(def)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("'assigned' can only be set for multifunction " + "PCI devices")); + return -1; + } + + if (STREQ(assigned, "no")) + def->assigned = false; + + if ((addressnode = virXPathNode("./address", ctxt)) && + !def->assigned) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("unable to set <address> element " + "for assign='no' hostdev")); + return -1; + } + } + backend = VIR_DOMAIN_HOSTDEV_PCI_BACKEND_DEFAULT; if ((backendStr = virXPathString("string(./driver/@name)", ctxt)) && (((backend = virDomainHostdevSubsysPCIBackendTypeFromString(backendStr)) < 0) || @@ -27254,6 +27281,11 @@ virDomainHostdevDefFormat(virBufferPtr buf, virBufferAsprintf(buf, " managed='%s'", def->managed ? "yes" : "no"); + if (def->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && + virHostdevIsPCIMultifunctionDevice(def)) + virBufferAsprintf(buf, " assigned='%s'", + def->assigned ? "yes" : "no"); + if (def->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_SCSI && scsisrc->sgio) virBufferAsprintf(buf, " sgio='%s'", diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 2884af49d8..e2f6f640b9 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -343,6 +343,7 @@ struct _virDomainHostdevDef { bool missing; bool readonly; bool shareable; + bool assigned; union { virDomainHostdevSubsys subsys; virDomainHostdevCaps caps; diff --git a/tests/qemuxml2argvdata/hostdev-pci-multifunction.args b/tests/qemuxml2argvdata/hostdev-pci-multifunction.args index d8690c010b..3cef177c00 100644 --- a/tests/qemuxml2argvdata/hostdev-pci-multifunction.args +++ b/tests/qemuxml2argvdata/hostdev-pci-multifunction.args @@ -30,6 +30,7 @@ server,nowait \ -device vfio-pci,host=0001:01:00.0,id=hostdev2,bus=pci.0,addr=0x5 \ -device vfio-pci,host=0005:90:01.2,id=hostdev3,bus=pci.0,addr=0x6 \ -device vfio-pci,host=0005:90:01.3,id=hostdev4,bus=pci.0,addr=0x7 \ --device vfio-pci,host=0000:06:12.1,id=hostdev5,bus=pci.0,addr=0x8 \ --device vfio-pci,host=0000:06:12.2,id=hostdev6,bus=pci.0,addr=0x9 \ --device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0xa +-device vfio-pci,host=0000:06:12.0,id=hostdev5,bus=pci.0,addr=0x8 \ +-device vfio-pci,host=0000:06:12.1,id=hostdev6,bus=pci.0,addr=0x9 \ +-device vfio-pci,host=0000:06:12.2,id=hostdev7,bus=pci.0,addr=0xa \ +-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0xb diff --git a/tests/qemuxml2argvdata/hostdev-pci-multifunction.xml b/tests/qemuxml2argvdata/hostdev-pci-multifunction.xml index 06c889c64d..f4813961b3 100644 --- a/tests/qemuxml2argvdata/hostdev-pci-multifunction.xml +++ b/tests/qemuxml2argvdata/hostdev-pci-multifunction.xml @@ -43,6 +43,12 @@ <address domain='0x0005' bus='0x90' slot='0x01' function='0x3'/> </source> </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x0'/> + </source> + </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> diff --git a/tests/qemuxml2xmloutdata/hostdev-pci-multifunction.xml b/tests/qemuxml2xmloutdata/hostdev-pci-multifunction.xml index 52ed86e305..079e0513c1 100644 --- a/tests/qemuxml2xmloutdata/hostdev-pci-multifunction.xml +++ b/tests/qemuxml2xmloutdata/hostdev-pci-multifunction.xml @@ -23,35 +23,35 @@ <controller type='pci' index='0' model='pci-root'/> <input type='mouse' bus='ps2'/> <input type='keyboard' bus='ps2'/> - <hostdev mode='subsystem' type='pci' managed='yes'> + <hostdev mode='subsystem' type='pci' managed='yes' assigned='yes'> <driver name='vfio'/> <source> <address domain='0x0005' bus='0x90' slot='0x01' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </hostdev> - <hostdev mode='subsystem' type='pci' managed='yes'> + <hostdev mode='subsystem' type='pci' managed='yes' assigned='yes'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x01' slot='0x00' function='0x1'/> </source> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </hostdev> - <hostdev mode='subsystem' type='pci' managed='yes'> + <hostdev mode='subsystem' type='pci' managed='yes' assigned='yes'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </hostdev> - <hostdev mode='subsystem' type='pci' managed='yes'> + <hostdev mode='subsystem' type='pci' managed='yes' assigned='yes'> <driver name='vfio'/> <source> <address domain='0x0005' bus='0x90' slot='0x01' function='0x2'/> </source> <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> </hostdev> - <hostdev mode='subsystem' type='pci' managed='yes'> + <hostdev mode='subsystem' type='pci' managed='yes' assigned='yes'> <driver name='vfio'/> <source> <address domain='0x0005' bus='0x90' slot='0x01' function='0x3'/> @@ -61,19 +61,26 @@ <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> - <address domain='0x0000' bus='0x06' slot='0x12' function='0x1'/> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> - <address domain='0x0000' bus='0x06' slot='0x12' function='0x2'/> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x1'/> </source> <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/> </hostdev> - <memballoon model='virtio'> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x2'/> + </source> <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/> + </hostdev> + <memballoon model='virtio'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x0b' function='0x0'/> </memballoon> </devices> </domain> diff --git a/tests/qemuxml2xmloutdata/pseries-hostdevs-1.xml b/tests/qemuxml2xmloutdata/pseries-hostdevs-1.xml index e77a060a38..94472b38e4 100644 --- a/tests/qemuxml2xmloutdata/pseries-hostdevs-1.xml +++ b/tests/qemuxml2xmloutdata/pseries-hostdevs-1.xml @@ -35,14 +35,14 @@ </source> <address type='pci' domain='0x0000' bus='0x01' slot='0x01' function='0x0'/> </interface> - <hostdev mode='subsystem' type='pci' managed='yes'> + <hostdev mode='subsystem' type='pci' managed='yes' assigned='yes'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/> </hostdev> - <hostdev mode='subsystem' type='pci' managed='yes'> + <hostdev mode='subsystem' type='pci' managed='yes' assigned='yes'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x01' slot='0x00' function='0x1'/> diff --git a/tests/qemuxml2xmloutdata/pseries-hostdevs-2.xml b/tests/qemuxml2xmloutdata/pseries-hostdevs-2.xml index cfa395b001..f2fe94d2fb 100644 --- a/tests/qemuxml2xmloutdata/pseries-hostdevs-2.xml +++ b/tests/qemuxml2xmloutdata/pseries-hostdevs-2.xml @@ -30,14 +30,14 @@ <model name='spapr-pci-host-bridge'/> <target index='2'/> </controller> - <hostdev mode='subsystem' type='pci' managed='yes'> + <hostdev mode='subsystem' type='pci' managed='yes' assigned='yes'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x01' slot='0x02' function='0x0'/> </hostdev> - <hostdev mode='subsystem' type='pci' managed='yes'> + <hostdev mode='subsystem' type='pci' managed='yes' assigned='yes'> <driver name='vfio'/> <source> <address domain='0x0005' bus='0x90' slot='0x01' function='0x0'/> diff --git a/tests/qemuxml2xmloutdata/pseries-hostdevs-3.xml b/tests/qemuxml2xmloutdata/pseries-hostdevs-3.xml index f91959b805..0893ecd887 100644 --- a/tests/qemuxml2xmloutdata/pseries-hostdevs-3.xml +++ b/tests/qemuxml2xmloutdata/pseries-hostdevs-3.xml @@ -27,14 +27,14 @@ <model name='spapr-pci-host-bridge'/> <target index='2'/> </controller> - <hostdev mode='subsystem' type='pci' managed='yes'> + <hostdev mode='subsystem' type='pci' managed='yes' assigned='yes'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/> </hostdev> - <hostdev mode='subsystem' type='pci' managed='yes'> + <hostdev mode='subsystem' type='pci' managed='yes' assigned='yes'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x01' slot='0x00' function='0x1'/> diff --git a/tests/virpcitestdata/0005-90-01.1.config b/tests/virpcitestdata/0005-90-01.1.config index beee76534041a7020c08ae9ac03d9a349c6ea12e..a60599bd342d3ebcdc7b8367ca36ad337f602fde 100644 GIT binary patch delta 44 ycmZo*YG4vE7BFRCV-R3+7GUOK;Amg~f~JXq5)*X<85t+qEn;Cc-jFjfPzC^<7YL>R delta 39 ucmZo*YG4vE7BFRCV-R3+7GUOK;9y{25MXGU7$`AON05<eqTQm22?_viD+cla diff --git a/tests/virpcitestdata/0005-90-01.2.config b/tests/virpcitestdata/0005-90-01.2.config index cfd72e4d7165bff2751ecbdc570ce7f1e0646226..a60599bd342d3ebcdc7b8367ca36ad337f602fde 100644 GIT binary patch literal 256 zcmXpOFlAt45MXi^VCG@qXkY+>CJ=!Q7z5RUfCHEW5{!&mj0{Y5Fz#TaS&cX3;ByxM DCDjDB literal 256 zcmXpOc)-BMAi%_;z|6zo!oa|wz|aIFu>xbDS`csmlR$!5K#7rosSd`)Mk^@TV-u#E T7_0Gy9FS#<5E~CbC<F-rZiWW} diff --git a/tests/virpcitestdata/0005-90-01.3.config b/tests/virpcitestdata/0005-90-01.3.config new file mode 100644 index 0000000000000000000000000000000000000000..a60599bd342d3ebcdc7b8367ca36ad337f602fde GIT binary patch literal 256 zcmXpOFlAt45MXi^VCG@qXkY+>CJ=!Q7z5RUfCHEW5{!&mj0{Y5Fz#TaS&cX3;ByxM DCDjDB literal 0 HcmV?d00001 diff --git a/tools/virsh-domain.c b/tools/virsh-domain.c index fbfdc09c0d..1453d7f024 100644 --- a/tools/virsh-domain.c +++ b/tools/virsh-domain.c @@ -858,6 +858,11 @@ static const vshCmdOptDef opts_attach_interface[] = { .type = VSH_OT_BOOL, .help = N_("libvirt will automatically detach/attach the device from/to host") }, + {.name = "assigned", + .type = VSH_OT_BOOL, + .help = N_("hostdev virtual function will be assigned to the guest " + "(multifunction PCI device only)") + }, {.name = NULL} }; -- 2.21.0

Bulding up in the new attribute introduced by the previous patch, let's not assign a hostdev to a QEMU domain by avoiding the '-device' line to be created at all inside qemu_command.c. Since the device isn't going to be guest visible, makes little sense to assign alias and address to it, so let's also prevent that. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_alias.c | 8 +++ src/qemu/qemu_command.c | 5 ++ src/qemu/qemu_domain_address.c | 7 +++ .../hostdev-pci-multifunction-partial.args | 31 ++++++++++ .../hostdev-pci-multifunction-partial.xml | 41 +++++++++++++ tests/qemuxml2argvtest.c | 4 ++ .../hostdev-pci-multifunction-partial.xml | 57 +++++++++++++++++++ tests/qemuxml2xmltest.c | 1 + 8 files changed, 154 insertions(+) create mode 100644 tests/qemuxml2argvdata/hostdev-pci-multifunction-partial.args create mode 100644 tests/qemuxml2argvdata/hostdev-pci-multifunction-partial.xml create mode 100644 tests/qemuxml2xmloutdata/hostdev-pci-multifunction-partial.xml diff --git a/src/qemu/qemu_alias.c b/src/qemu/qemu_alias.c index d294963d35..f1705a9a47 100644 --- a/src/qemu/qemu_alias.c +++ b/src/qemu/qemu_alias.c @@ -602,6 +602,14 @@ qemuAssignDeviceAliases(virDomainDefPtr def, virQEMUCapsPtr qemuCaps) return -1; } for (i = 0; i < def->nhostdevs; i++) { + /* do not assign alias for PCI multifunction hostdevs that + * will not be assigned to the guest. + */ + + if (virHostdevIsPCIMultifunctionDevice(def->hostdevs[i]) && + !def->hostdevs[i]->assigned) + continue; + /* we can't start assigning at 0, since netdevs may have used * up some hostdevN entries already. Also if the HostdevDef is * linked to a NetDef, they will share an info and the alias diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index cbf25d5f07..1e12550e9b 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -5490,6 +5490,11 @@ qemuBuildHostdevCommandLine(virCommandPtr cmd, } } + /* Ignore 'noAssign' devices + */ + if (!hostdev->assigned) + continue; + unsigned int bootIndex = hostdev->info->bootIndex; /* bootNet will be non-0 if boot order was set and no other diff --git a/src/qemu/qemu_domain_address.c b/src/qemu/qemu_domain_address.c index 4f5671f458..71711e22d2 100644 --- a/src/qemu/qemu_domain_address.c +++ b/src/qemu/qemu_domain_address.c @@ -2319,6 +2319,13 @@ qemuDomainAssignDevicePCISlots(virDomainDefPtr def, continue; } + /* Do not assign an address to a hostdev that will not + * be assigned to the guest. + */ + if (virHostdevIsPCIMultifunctionDevice(def->hostdevs[i]) && + !def->hostdevs[i]->assigned) + continue; + if (qemuDomainPCIAddressReserveNextAddr(addrs, def->hostdevs[i]->info) < 0) goto error; diff --git a/tests/qemuxml2argvdata/hostdev-pci-multifunction-partial.args b/tests/qemuxml2argvdata/hostdev-pci-multifunction-partial.args new file mode 100644 index 0000000000..4074968c84 --- /dev/null +++ b/tests/qemuxml2argvdata/hostdev-pci-multifunction-partial.args @@ -0,0 +1,31 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/tmp/lib/domain--1-delete \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/tmp/lib/domain--1-delete/.local/share \ +XDG_CACHE_HOME=/tmp/lib/domain--1-delete/.cache \ +XDG_CONFIG_HOME=/tmp/lib/domain--1-delete/.config \ +QEMU_AUDIO_DRV=none \ +/usr/bin/qemu-system-x86_64 \ +-name delete \ +-S \ +-machine pc,accel=kvm,usb=off,dump-guest-core=off \ +-m 256 \ +-realtime mlock=off \ +-smp 4,sockets=4,cores=1,threads=1 \ +-uuid 583a8e8e-f0ce-4f53-89ab-092862148b25 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,path=/tmp/lib/domain--1-delete/monitor.sock,\ +server,nowait \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-no-acpi \ +-usb \ +-device vfio-pci,host=0005:90:01.0,id=hostdev0,bus=pci.0,addr=0x3 \ +-device vfio-pci,host=0005:90:01.1,id=hostdev1,bus=pci.0,addr=0x4 \ +-device vfio-pci,host=0005:90:01.3,id=hostdev2,bus=pci.0,addr=0x5 \ +-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 diff --git a/tests/qemuxml2argvdata/hostdev-pci-multifunction-partial.xml b/tests/qemuxml2argvdata/hostdev-pci-multifunction-partial.xml new file mode 100644 index 0000000000..e87dfcdcff --- /dev/null +++ b/tests/qemuxml2argvdata/hostdev-pci-multifunction-partial.xml @@ -0,0 +1,41 @@ +<domain type='kvm'> + <name>delete</name> + <uuid>583a8e8e-f0ce-4f53-89ab-092862148b25</uuid> + <memory unit='KiB'>262144</memory> + <vcpu placement='static'>4</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + </os> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='usb' index='0'/> + <controller type='ide' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x0'/> + </source> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x1'/> + </source> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes' assigned='no'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x2'/> + </source> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x3'/> + </source> + </hostdev> + </devices> +</domain> diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index db79301f0e..434e01308c 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -1357,6 +1357,10 @@ mymain(void) QEMU_CAPS_KVM, QEMU_CAPS_DEVICE_VFIO_PCI); + DO_TEST("hostdev-pci-multifunction-partial", + QEMU_CAPS_KVM, + QEMU_CAPS_DEVICE_VFIO_PCI); + DO_TEST("serial-file-log", QEMU_CAPS_CHARDEV_FILE_APPEND, QEMU_CAPS_DEVICE_ISA_SERIAL, diff --git a/tests/qemuxml2xmloutdata/hostdev-pci-multifunction-partial.xml b/tests/qemuxml2xmloutdata/hostdev-pci-multifunction-partial.xml new file mode 100644 index 0000000000..aa8da25ccd --- /dev/null +++ b/tests/qemuxml2xmloutdata/hostdev-pci-multifunction-partial.xml @@ -0,0 +1,57 @@ +<domain type='kvm'> + <name>delete</name> + <uuid>583a8e8e-f0ce-4f53-89ab-092862148b25</uuid> + <memory unit='KiB'>262144</memory> + <currentMemory unit='KiB'>262144</currentMemory> + <vcpu placement='static'>4</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='usb' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> + </controller> + <controller type='ide' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/> + </controller> + <controller type='pci' index='0' model='pci-root'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <hostdev mode='subsystem' type='pci' managed='yes' assigned='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x0'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes' assigned='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x1'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes' assigned='no'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x2'/> + </source> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes' assigned='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x3'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> + </hostdev> + <memballoon model='virtio'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> + </memballoon> + </devices> +</domain> diff --git a/tests/qemuxml2xmltest.c b/tests/qemuxml2xmltest.c index d5c66d8791..5e0fa6af86 100644 --- a/tests/qemuxml2xmltest.c +++ b/tests/qemuxml2xmltest.c @@ -441,6 +441,7 @@ mymain(void) DO_TEST("hostdev-usb-address", NONE); DO_TEST("hostdev-pci-address", NONE); DO_TEST("hostdev-pci-multifunction", NONE); + DO_TEST("hostdev-pci-multifunction-partial", NONE); DO_TEST("hostdev-vfio", NONE); DO_TEST("hostdev-vfio-zpci", QEMU_CAPS_DEVICE_ZPCI, -- 2.21.0

This is the last piece for a full fledged PCI multifunction partial assignment support. Now that we have a defined XML format and QEMU is ignoring the existence of the hostdev that isn't being assigned, let's enforce that all functions must be declared in the domain XML for such devices. This will make sure that we'll always detach/re-attach the whole IOMMU and will force the user to be aware of what is going to be detached from the host. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/util/virhostdev.c | 64 ++++++++++++++++++- ...ostdev-pci-multifunction-partial-fail.args | 31 +++++++++ ...hostdev-pci-multifunction-partial-fail.xml | 35 ++++++++++ tests/qemuxml2argvtest.c | 4 ++ 4 files changed, 132 insertions(+), 2 deletions(-) create mode 100644 tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.args create mode 100644 tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.xml diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index 1aa8e9729d..a85860b34f 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -56,6 +56,13 @@ struct virHostdevIsPCINodeDeviceUsedData { bool usesVFIO; }; +struct virHostdevIsAllIOMMUGroupUsedData { + virPCIDeviceListPtr pcidevs; + const char *domainName; + const char *deviceName; +}; + + /* This module makes heavy use of bookkeeping lists contained inside a * virHostdevManager instance to keep track of the devices' status. To make * it easy to spot potential ownership errors when moving devices from one @@ -114,6 +121,26 @@ static int virHostdevIsPCINodeDeviceUsed(virPCIDeviceAddressPtr devAddr, void *o return ret; } +static int virHostdevIsAllIOMMUGroupUsed(virPCIDeviceAddressPtr devAddr, void *opaque) +{ + struct virHostdevIsAllIOMMUGroupUsedData *helperData = opaque; + virPCIDevicePtr actual; + + actual = virPCIDeviceListFindByIDs(helperData->pcidevs, + devAddr->domain, devAddr->bus, + devAddr->slot, devAddr->function); + if (actual) { + return 0; + } else { + virReportError(VIR_ERR_OPERATION_INVALID, + _("All devices of the same IOMMU group as the " + "multifunction PCI device %s must belong to " + "domain %s"), + helperData->deviceName, helperData->domainName); + return -1; + } +} + static int virHostdevManagerOnceInit(void) { if (!VIR_CLASS_NEW(virHostdevManager, virClassForObject())) @@ -714,9 +741,10 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, unsigned int flags) { VIR_AUTOUNREF(virPCIDeviceListPtr) pcidevs = NULL; + VIR_AUTOFREE(unsigned int *) searchedIOMMUs = NULL; int last_processed_hostdev_vf = -1; - size_t i; - int ret = -1; + size_t i, j; + int ret = -1, nSearchedIOMMUs = 0; virPCIDeviceAddressPtr devAddr = NULL; if (!nhostdevs) @@ -725,6 +753,9 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, if (!(pcidevs = virHostdevGetPCIHostDeviceList(hostdevs, nhostdevs))) return -1; + if (VIR_ALLOC_N(searchedIOMMUs, virPCIDeviceListCount(pcidevs))) + return -1; + virObjectLock(mgr->activePCIHostdevs); virObjectLock(mgr->inactivePCIHostdevs); @@ -767,6 +798,35 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, if (virHostdevIsPCINodeDeviceUsed(devAddr, &data)) goto cleanup; + /* Multifunction PCI devices are more restrict than regular + * VFIO devices, since it requires *all* devices from the same + * IOMMU to belong to the same domain - even if not all of + * them are assigned to the guest. + */ + if (virPCIDeviceIsMultifunction(pci)) { + struct virHostdevIsAllIOMMUGroupUsedData helper = { + pcidevs, dom_name, virPCIDeviceGetName(pci)}; + int devIOMMUGroup = virPCIDeviceAddressGetIOMMUGroupNum(devAddr); + bool alreadySearched = false; + + for (j = 0; j < nSearchedIOMMUs; j++) { + if (devIOMMUGroup == searchedIOMMUs[j]) { + alreadySearched = true; + break; + } + } + + if (alreadySearched) + continue; + + if (virPCIDeviceAddressIOMMUGroupIterate( + devAddr, virHostdevIsAllIOMMUGroupUsed, &helper) < 0) + goto cleanup; + + searchedIOMMUs[nSearchedIOMMUs++] = devIOMMUGroup; + continue; + } + /* VFIO devices belonging to same IOMMU group can't be * shared across guests. Check if that's the case. */ if (usesVFIO) { diff --git a/tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.args b/tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.args new file mode 100644 index 0000000000..4074968c84 --- /dev/null +++ b/tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.args @@ -0,0 +1,31 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/tmp/lib/domain--1-delete \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/tmp/lib/domain--1-delete/.local/share \ +XDG_CACHE_HOME=/tmp/lib/domain--1-delete/.cache \ +XDG_CONFIG_HOME=/tmp/lib/domain--1-delete/.config \ +QEMU_AUDIO_DRV=none \ +/usr/bin/qemu-system-x86_64 \ +-name delete \ +-S \ +-machine pc,accel=kvm,usb=off,dump-guest-core=off \ +-m 256 \ +-realtime mlock=off \ +-smp 4,sockets=4,cores=1,threads=1 \ +-uuid 583a8e8e-f0ce-4f53-89ab-092862148b25 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,path=/tmp/lib/domain--1-delete/monitor.sock,\ +server,nowait \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-no-acpi \ +-usb \ +-device vfio-pci,host=0005:90:01.0,id=hostdev0,bus=pci.0,addr=0x3 \ +-device vfio-pci,host=0005:90:01.1,id=hostdev1,bus=pci.0,addr=0x4 \ +-device vfio-pci,host=0005:90:01.3,id=hostdev2,bus=pci.0,addr=0x5 \ +-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 diff --git a/tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.xml b/tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.xml new file mode 100644 index 0000000000..fc3e3bb520 --- /dev/null +++ b/tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.xml @@ -0,0 +1,35 @@ +<domain type='kvm'> + <name>delete</name> + <uuid>583a8e8e-f0ce-4f53-89ab-092862148b25</uuid> + <memory unit='KiB'>262144</memory> + <vcpu placement='static'>4</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + </os> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='usb' index='0'/> + <controller type='ide' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x0'/> + </source> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x1'/> + </source> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x3'/> + </source> + </hostdev> + </devices> +</domain> diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index 434e01308c..462852c690 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -1361,6 +1361,10 @@ mymain(void) QEMU_CAPS_KVM, QEMU_CAPS_DEVICE_VFIO_PCI); + DO_TEST("hostdev-pci-multifunction-partial-fail", + QEMU_CAPS_KVM, + QEMU_CAPS_DEVICE_VFIO_PCI); + DO_TEST("serial-file-log", QEMU_CAPS_CHARDEV_FILE_APPEND, QEMU_CAPS_DEVICE_ISA_SERIAL, -- 2.21.0

On Mon, 7 Oct 2019 18:11:32 -0300 Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
(--- long post warning ---)
This is a work that derived from the discussions I had with Laine Stump and Alex Williamson in [1]. I'll provide a quick gist below.
----------
Today, Libvirt does not have proper support for partial assignment of functions of passed-through PCI multifunction devices (hostdev with VFIO-PCI). By partial assignment I mean the guest being able to use just some, not all, virtual functions of the device. Even if the functions itself became useless in the host, the some functions might not be safe to be used by the guest, thus the user should be able to limit it.
Not safe in what way? Patch 2/4 says some devices might be "security sensitive", but the fact that this patch is necessary implies that the host kernel already considers the devices non-isolated. They must be in the same iommu group to have this issue. Is there a concrete example of a device where a user would want this configuration? The case I can think of is not a security issue, but a functional one where GPU and audio functions are grouped together and maybe the audio function doesn't work well when assigned, or maybe we just want the guest to default to another audio device and it's easier if we just don't expose this on-card audio.
I mentioned 'proper' because today it is possible to get this done in Libvirt if we use 'managed=no' in the hostdevs. If the user makes the proper setup (i.e. detaching all IOMMU devices), and use managed='no', Libvirt will launch the guest just with the functions declared in the XML. The technical reason for this is simple: in virHostdevPreparePCIDevices() we do not take into account that multifunction PCI devices requires the whole IOMMU to be detached, not just the devices being declared in def->hostdevs. In this case, managed='yes' will not work in this scenario, causing errors in QEMU launch.
The discussion I've started in [1] was motivated by my attempt of automatically detaching the IOMMU inside the prepare function with managed='yes' devices. Laine discarded this idea, arguing that the concept of partial assignment will cause user confusion if Libvirt starts to handle things without the user being fully aware. In [1] it was discussed the possibility of declaring the functions that won't be assigned to the guest in the XML, forcing the user to be aware that these functions will be lost in the host, as a possible approach for a solution.
-----------
These series tries to solve the partial assignment of multifunction hostdev PCI devices by introducing a new hostdev attribute called 'assigned'. This is how it works:
- it is a boolean value that will be efffective just for multifunction hostdev PCI devices, since there's no other occurrence for this kind of use in Libvirt. Trying to declare assign='yes|no' in any other PCI hostdev device will cause parse errors;
- default value if the attribute is not present is 'assigned=yes';
- <address> element will be forbidden if the hostdev is declared with assigned='no'. This is to make more evident to the user that this is a function that the guest will NOT be using, with a bonus that we will not need to calculate an address that won't be used;
It seems more intuitive to me to use the guest <address> element to expose this. libvirt often makes use of 'none' to declare empty devices, so maybe <address type='none'/> would be more in line with precedent. Thanks, Alex

On 10/7/19 7:41 PM, Alex Williamson wrote:
On Mon, 7 Oct 2019 18:11:32 -0300 Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
(--- long post warning ---)
This is a work that derived from the discussions I had with Laine Stump and Alex Williamson in [1]. I'll provide a quick gist below.
----------
Today, Libvirt does not have proper support for partial assignment of functions of passed-through PCI multifunction devices (hostdev with VFIO-PCI). By partial assignment I mean the guest being able to use just some, not all, virtual functions of the device. Even if the functions itself became useless in the host, the some functions might not be safe to be used by the guest, thus the user should be able to limit it. Not safe in what way? Patch 2/4 says some devices might be "security sensitive", but the fact that this patch is necessary implies that the host kernel already considers the devices non-isolated. They must be in the same iommu group to have this issue. Is there a concrete example of a device where a user would want this configuration? The case I can think of is not a security issue, but a functional one where GPU and audio functions are grouped together and maybe the audio function doesn't work well when assigned, or maybe we just want the guest to default to another audio device and it's easier if we just don't expose this on-card audio.
The audio card example is one was thinking of (and I believe it was brought up in the [1] thread as well) when writing about the need for this work. But in the end, I believe my use of 'security issue' wording is wrong - I am implying that there might be a case in which isolated devices, in the same IOMMU, can present security risks for each other or something like that when assigned to the guest. I can't make such strong claim. These patches are mote about enhancing a functional use, like you said. Thanks for pointing this out. I'll rearrange the discourse in the next spins.
I mentioned 'proper' because today it is possible to get this done in Libvirt if we use 'managed=no' in the hostdevs. If the user makes the proper setup (i.e. detaching all IOMMU devices), and use managed='no', Libvirt will launch the guest just with the functions declared in the XML. The technical reason for this is simple: in virHostdevPreparePCIDevices() we do not take into account that multifunction PCI devices requires the whole IOMMU to be detached, not just the devices being declared in def->hostdevs. In this case, managed='yes' will not work in this scenario, causing errors in QEMU launch.
The discussion I've started in [1] was motivated by my attempt of automatically detaching the IOMMU inside the prepare function with managed='yes' devices. Laine discarded this idea, arguing that the concept of partial assignment will cause user confusion if Libvirt starts to handle things without the user being fully aware. In [1] it was discussed the possibility of declaring the functions that won't be assigned to the guest in the XML, forcing the user to be aware that these functions will be lost in the host, as a possible approach for a solution.
-----------
These series tries to solve the partial assignment of multifunction hostdev PCI devices by introducing a new hostdev attribute called 'assigned'. This is how it works:
- it is a boolean value that will be efffective just for multifunction hostdev PCI devices, since there's no other occurrence for this kind of use in Libvirt. Trying to declare assign='yes|no' in any other PCI hostdev device will cause parse errors;
- default value if the attribute is not present is 'assigned=yes';
- <address> element will be forbidden if the hostdev is declared with assigned='no'. This is to make more evident to the user that this is a function that the guest will NOT be using, with a bonus that we will not need to calculate an address that won't be used; It seems more intuitive to me to use the guest <address> element to expose this. libvirt often makes use of 'none' to declare empty devices, so maybe <address type='none'/> would be more in line with precedent. Thanks,
If <address type='none'> is not being used by anything else (it doesn't appear to be, at least in a quick look at libvirt.org docs), this is a good idea indeed. It also spare us from adding more documentation for a new attribute. I'll wait to see if more people wants to comment in this work and, unless someone presents a good reason not to, I'll see if I can make this <address type='none'> happen. Thanks, DHB
Alex
participants (2)
-
Alex Williamson
-
Daniel Henrique Barboza