[libvirt] [PATCH v2 0/4] PCI hostdev partial assignment support

Changes from previous version [1], all of them result of feedback from Alex Williamson and Abdulla Bubshait: - use <address type='none'/> instead of creating a new subsys attribute; - expand the change to all PCI hostdevs. Former patch 01 was discarded because we don't need the PCI Multifunction helpers for now; - series changed name to reflect what it's being done - new patch 04: add documentation to formatdomain.html.in To avoid a huge wall of text please refer to [1] for context about the road up to here. Commit msgs of the first 3 patches tells the story as well. [1] https://www.redhat.com/archives/libvir-list/2019-October/msg00298.html What I want to discuss here instead is a caveat that I've found while testing this work, since its first version. This test was done in a Power 8 system with a Broadcom BCM5719 PCIe Multifunction card, with 4 virtual functions. This series enables Libvirt to declare PCI hostdevs that will not be visible to the guest using address type='none'. During the tests I faced a scenario that I expected to fail, but it didn't. This is the relevant XML except: <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x09' slot='0x00' function='0x0'/> </source> <address type='none'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x09' slot='0x00' function='0x1'/> </source> <address type='none'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x09' slot='0x00' function='0x2'/> </source> <address type='pci' domain='0x0000' bus='0x01' slot='0x01' function='0x2'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x09' slot='0x00' function='0x3'/> </source> <address type='none'/> </hostdev> I'm declaring all the BCM5719 functions in the XML, but I am making functions 0, 1 and 3 unassignable by the guest using address type='none'. This test was meant to fail, but it didn't. To my surprise the guest booted and the device is functional: $ lspci 0000:00:01.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon 0000:00:03.0 USB controller: Red Hat, Inc. QEMU XHCI Host Controller (rev 01) 0000:00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device 0001:00:01.2 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) $ I've talked with Michael Roth (QEMU PPC64 developer that worked with the PCI multifunction hotplug/unplug support in the PPC64 machine) about this. He mentioned that this is intended. I'll quote here what he had to say about it: "The current logic is that we only emit the hotplug event when function 0 is attached, but if some other function is attached at boot-time the guest will still see it on the bus, and whether that works or not I think is up to the device/driver" This explains why this test didn't fail as I expected. At least for the PPC64 machine, depending on the device support, this setup is allowed. PPC64 machine uses function 0 hotplug as a signal of 'plug all the queue functions and function 0', but function 0 isn't required at boot time. I would like to hear other opinions in this because I can't say whether this is allowed in x86. I am mentioning all this now because this had a direct impact on the design of this work since the previous version, and I failed to bring it up back then. I am *not* checking for the assignment of function 0 at guest boot time in Libvirt, leaving the user free to decide what to do. I am aware that this will be inconsistent to the logic of the PCI multifunction hotplug/unplug support, where function 0 is required. This also puts a lot of faith in the user, relying that the user is fully aware of the capabilities of the hardware. My question is: should Libvirt force function 0 to be present in boot time as well, regardless of whether the PPC64 guest or some cards are able to boot without it? Thanks, DHB Daniel Henrique Barboza (4): domain_conf: allow address type='none' to unassign PCI hostdevs qemu: handle unassigned PCI hostdevs in command line and alias virhostdev.c: check all IOMMU devs in virHostdevPreparePCIDevices formatdomain.html.in: document <address type='none'/> docs/formatdomain.html.in | 15 +++++ docs/schemas/domaincommon.rng | 5 ++ src/conf/domain_conf.c | 56 ++++++++++++++-- src/conf/domain_conf.h | 3 + src/qemu/qemu_alias.c | 6 ++ src/qemu/qemu_command.c | 4 ++ src/qemu/qemu_domain_address.c | 6 ++ src/util/virhostdev.c | 64 +++++++++++++++++-- .../hostdev-pci-address-none.args | 31 +++++++++ .../hostdev-pci-address-none.xml | 42 ++++++++++++ ...ostdev-pci-multifunction-partial-fail.args | 31 +++++++++ ...hostdev-pci-multifunction-partial-fail.xml | 35 ++++++++++ tests/qemuxml2argvtest.c | 8 +++ .../hostdev-pci-address-none.xml | 58 +++++++++++++++++ tests/qemuxml2xmltest.c | 1 + 15 files changed, 352 insertions(+), 13 deletions(-) create mode 100644 tests/qemuxml2argvdata/hostdev-pci-address-none.args create mode 100644 tests/qemuxml2argvdata/hostdev-pci-address-none.xml create mode 100644 tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.args create mode 100644 tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.xml create mode 100644 tests/qemuxml2xmloutdata/hostdev-pci-address-none.xml -- 2.21.0

Today, to use a PCI hostdev "A" in a domain, all PCI devices that belongs to the same IOMMU group must also be declared in the domain XML, meaning that all IOMMU devices are detached from the host and all of them are visible to the guest. The result is that the guest will have access to all devices, but this is not necessarily what the administrator wanted. If only the hostdev "A" was intended for guest usage, but hostdevs "B" and "C" happened to be in the same IOMMU group of "A", the guest will gain access to all 3 devices. This makes the administrator rely on alternative solutions, such as use all hostdevs with un-managed mode and detached all the IOMMU before the guest starts. If use un-managed mode is not an option, the only alternative left is an ACS patch to deny guest access to "B" and "C". Discussions made in [1] and [2] led to the approach implemented here. This patch builds upon the already existing VIR_DOMAIN_DEF_FORMAT_ADDRESS_NONE format to use it as a indication of whether a hostdev will be owned by a domain, but not visible to the guest OS. This allows the administrator to declare all the IOMMU while also choosing which hostdevs will be usable by the guest. This new mechanic applies to all PCI hostdevs, regardless of whether they are a PCI multifunction hostdev or not. Using <address type='none'/> in any case other than a PCI hostdev will result in error. The use of a new 'unassigned' flag in virDomainHostdevDef makes it easier to distinguish between the case we're handling here versus the case in which a PCI hostdev has no <address> element. Both are interpreted as VIR_DOMAIN_DEF_FORMAT_ADDRESS_NONE in the existing code, thus reducing the logic to a flag makes it easier to handle this new use case. In the next patch we'll use the 'unassigned' flag to filter the hostdevs being initialized in the QEMU command line. [1] https://www.redhat.com/archives/libvir-list/2019-July/msg01175.html [2] https://www.redhat.com/archives/libvir-list/2019-October/msg00298.html Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- docs/schemas/domaincommon.rng | 5 ++ src/conf/domain_conf.c | 56 ++++++++++++++++-- src/conf/domain_conf.h | 3 + src/qemu/qemu_domain_address.c | 6 ++ .../hostdev-pci-address-none.xml | 42 ++++++++++++++ .../hostdev-pci-address-none.xml | 58 +++++++++++++++++++ tests/qemuxml2xmltest.c | 1 + 7 files changed, 165 insertions(+), 6 deletions(-) create mode 100644 tests/qemuxml2argvdata/hostdev-pci-address-none.xml create mode 100644 tests/qemuxml2xmloutdata/hostdev-pci-address-none.xml diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index a83c9ae7a5..a9a3f2e2d1 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -5463,6 +5463,11 @@ </attribute> <ref name="dimmaddress"/> </group> + <group> + <attribute name="type"> + <value>none</value> + </attribute> + </group> </choice> </element> </define> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 54d6ae297e..1bb24d2632 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -7159,6 +7159,15 @@ virDomainDeviceInfoFormat(virBufferPtr buf, virBufferAddLit(buf, "/>\n"); } + /* Format <address type='none'/> for un-assigned hostdevs */ + if (flags & VIR_DOMAIN_DEF_FORMAT_ADDRESS_NONE) { + virBufferAsprintf(&attrBuf, " type='%s'", + virDomainDeviceAddressTypeToString(info->type)); + virXMLFormatElement(buf, "address", &attrBuf, &childBuf); + + return 0; + } + if (info->type == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE || info->type == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_S390) /* We're done here */ @@ -8129,7 +8138,7 @@ virDomainHostdevDefParseXMLSubsys(xmlNodePtr node, virDomainHostdevDefPtr def, unsigned int flags) { - xmlNodePtr sourcenode; + xmlNodePtr sourcenode, addressnode; int backend; virDomainHostdevSubsysPCIPtr pcisrc = &def->source.subsys.u.pci; virDomainHostdevSubsysSCSIPtr scsisrc = &def->source.subsys.u.scsi; @@ -8280,6 +8289,28 @@ virDomainHostdevDefParseXMLSubsys(xmlNodePtr node, if (virDomainHostdevSubsysPCIDefParseXML(sourcenode, def, flags) < 0) return -1; + /* @unassigned has meaning only for hostdev PCI devices. + * <address type='none/> has a special meaning in this case, + * telling the guest that this hostdev shouldn't be assigned + * by it. + * + * Note that both <address type='none'/> and no <address> element + * declared implies in VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE + * through the code. This is why we can't simply check for + * VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE to set the @unassigned + * flag. + */ + def->unassigned = false; + + if ((addressnode = virXPathNode("./address", ctxt))) { + g_autofree char *address_type = virXMLPropString(addressnode, + "type"); + int typeFromString = virDomainDeviceAddressTypeFromString(address_type); + + if (typeFromString == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE) + def->unassigned = true; + } + backend = VIR_DOMAIN_HOSTDEV_PCI_BACKEND_DEFAULT; if ((backendStr = virXPathString("string(./driver/@name)", ctxt)) && (((backend = virDomainHostdevSubsysPCIBackendTypeFromString(backendStr)) < 0) || @@ -15579,7 +15610,17 @@ virDomainHostdevDefParseXML(virDomainXMLOptionPtr xmlopt, } if (def->info->type == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE) { - if (virDomainDeviceInfoParseXML(xmlopt, node, def->info, + /* skip address parsing if we already know it is an un-assigned + * pci hostdev */ + bool skipParse = false; + + if (def->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && + def->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && + def->unassigned) + skipParse = true; + + if (!skipParse && + virDomainDeviceInfoParseXML(xmlopt, node, def->info, flags | VIR_DOMAIN_DEF_PARSE_ALLOW_BOOT | VIR_DOMAIN_DEF_PARSE_ALLOW_ROM) < 0) goto error; @@ -27082,6 +27123,7 @@ virDomainHostdevDefFormat(virBufferPtr buf, virDomainHostdevSubsysMediatedDevPtr mdevsrc = &def->source.subsys.u.mdev; virDomainHostdevSubsysSCSIVHostPtr scsihostsrc = &def->source.subsys.u.scsi_host; const char *type; + unsigned int formatFlags; if (!mode) { virReportError(VIR_ERR_INTERNAL_ERROR, @@ -27165,11 +27207,13 @@ virDomainHostdevDefFormat(virBufferPtr buf, if (def->shareable) virBufferAddLit(buf, "<shareable/>\n"); - if (virDomainDeviceInfoFormat(buf, def->info, - flags | VIR_DOMAIN_DEF_FORMAT_ALLOW_BOOT - | VIR_DOMAIN_DEF_FORMAT_ALLOW_ROM) < 0) { + formatFlags = flags | VIR_DOMAIN_DEF_FORMAT_ALLOW_BOOT + | VIR_DOMAIN_DEF_FORMAT_ALLOW_ROM; + if (def->unassigned) + formatFlags |= VIR_DOMAIN_DEF_FORMAT_ADDRESS_NONE; + + if (virDomainDeviceInfoFormat(buf, def->info, formatFlags) < 0) return -1; - } virBufferAdjustIndent(buf, -2); virBufferAddLit(buf, "</hostdev>\n"); diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 39fb42e29d..51851e3638 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -344,6 +344,7 @@ struct _virDomainHostdevDef { bool missing; bool readonly; bool shareable; + bool unassigned; union { virDomainHostdevSubsys subsys; virDomainHostdevCaps caps; @@ -3021,6 +3022,8 @@ typedef enum { VIR_DOMAIN_DEF_FORMAT_ALLOW_ROM = 1 << 6, VIR_DOMAIN_DEF_FORMAT_ALLOW_BOOT = 1 << 7, VIR_DOMAIN_DEF_FORMAT_CLOCK_ADJUST = 1 << 8, + /* format address type='none' for un-assigned PCI hostdevs */ + VIR_DOMAIN_DEF_FORMAT_ADDRESS_NONE = 1 << 9, } virDomainDefFormatFlags; /* Use these flags to skip specific domain ABI consistency checks done diff --git a/src/qemu/qemu_domain_address.c b/src/qemu/qemu_domain_address.c index 605984f80f..301a08197b 100644 --- a/src/qemu/qemu_domain_address.c +++ b/src/qemu/qemu_domain_address.c @@ -2313,6 +2313,12 @@ qemuDomainAssignDevicePCISlots(virDomainDefPtr def, continue; } + /* Do not assign an address to a hostdev that will not + * be assigned to the guest. + */ + if (def->hostdevs[i]->unassigned) + continue; + if (qemuDomainPCIAddressReserveNextAddr(addrs, def->hostdevs[i]->info) < 0) goto error; diff --git a/tests/qemuxml2argvdata/hostdev-pci-address-none.xml b/tests/qemuxml2argvdata/hostdev-pci-address-none.xml new file mode 100644 index 0000000000..63c9d4eef6 --- /dev/null +++ b/tests/qemuxml2argvdata/hostdev-pci-address-none.xml @@ -0,0 +1,42 @@ +<domain type='kvm'> + <name>delete</name> + <uuid>583a8e8e-f0ce-4f53-89ab-092862148b25</uuid> + <memory unit='KiB'>262144</memory> + <vcpu placement='static'>4</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + </os> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='usb' index='0'/> + <controller type='ide' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x0'/> + </source> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x1'/> + </source> + <address type='none'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x2'/> + </source> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x3'/> + </source> + </hostdev> + </devices> +</domain> diff --git a/tests/qemuxml2xmloutdata/hostdev-pci-address-none.xml b/tests/qemuxml2xmloutdata/hostdev-pci-address-none.xml new file mode 100644 index 0000000000..08aa5323e4 --- /dev/null +++ b/tests/qemuxml2xmloutdata/hostdev-pci-address-none.xml @@ -0,0 +1,58 @@ +<domain type='kvm'> + <name>delete</name> + <uuid>583a8e8e-f0ce-4f53-89ab-092862148b25</uuid> + <memory unit='KiB'>262144</memory> + <currentMemory unit='KiB'>262144</currentMemory> + <vcpu placement='static'>4</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='usb' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> + </controller> + <controller type='ide' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/> + </controller> + <controller type='pci' index='0' model='pci-root'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x0'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x1'/> + </source> + <address type='none'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x2'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x3'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> + </hostdev> + <memballoon model='virtio'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> + </memballoon> + </devices> +</domain> diff --git a/tests/qemuxml2xmltest.c b/tests/qemuxml2xmltest.c index ed53ddc8c6..039d5ac1ad 100644 --- a/tests/qemuxml2xmltest.c +++ b/tests/qemuxml2xmltest.c @@ -423,6 +423,7 @@ mymain(void) DO_TEST("hostdev-usb-address", NONE); DO_TEST("hostdev-pci-address", NONE); + DO_TEST("hostdev-pci-address-none", NONE); DO_TEST("hostdev-pci-multifunction", NONE); DO_TEST("hostdev-vfio", NONE); DO_TEST("hostdev-vfio-zpci", -- 2.21.0

On Thu, Nov 21, 2019 at 07:19:14PM -0300, Daniel Henrique Barboza wrote:
Today, to use a PCI hostdev "A" in a domain, all PCI devices that belongs to the same IOMMU group must also be declared in the domain XML, meaning that all IOMMU devices are detached from the host and all of them are visible to the guest.
The result is that the guest will have access to all devices, but this is not necessarily what the administrator wanted. If only the hostdev "A" was intended for guest usage, but hostdevs "B" and "C" happened to be in the same IOMMU group of "A", the guest will gain access to all 3 devices. This makes the administrator rely on alternative solutions, such as use all hostdevs with un-managed mode and detached all the IOMMU before the guest starts. If use un-managed mode is not an option, the only alternative left is an ACS patch to deny guest access to "B" and "C".
Discussions made in [1] and [2] led to the approach implemented here. This patch builds upon the already existing VIR_DOMAIN_DEF_FORMAT_ADDRESS_NONE format to use it as a indication of whether a hostdev will be owned by a domain, but not visible to the guest OS. This allows the administrator to declare all the IOMMU while also choosing which hostdevs will be usable by the guest. This new mechanic applies to all PCI hostdevs, regardless of whether they are a PCI multifunction hostdev or not. Using <address type='none'/> in any case other than a PCI hostdev will result in error.
The use of a new 'unassigned' flag in virDomainHostdevDef makes it easier to distinguish between the case we're handling here versus the case in which a PCI hostdev has no <address> element. Both are interpreted as VIR_DOMAIN_DEF_FORMAT_ADDRESS_NONE in the existing code, thus reducing the logic to a flag makes it easier to handle this new use case. In the next patch we'll use the 'unassigned' flag to filter the hostdevs being initialized in the QEMU command line.
This paragraph illustrates why I think this approach is a bad idea. It is overloading ADDRESS_NONE to have 2 separate meanings. When we've done such things in the past we've usually ended up regretting it. We're using ADDRESS_NONE as our signal that the user does not care about the address and thus libvirt should automatically assign one. That's consistent across all the different types of devices that we have. I don't want to see us add special semantics for this just for host devices. The fact that you stil lneeded a "bool unassigned" flag to be stored in the virDomainHostdevDef is a clear sign that this must be exposed as a distinct XML attribute. I see you in fact did exactly this in the previous version of this patch series, so I'd like to go back to that version. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 11/22/19 12:13 PM, Daniel P. Berrangé wrote:
On Thu, Nov 21, 2019 at 07:19:14PM -0300, Daniel Henrique Barboza wrote: [...]
The use of a new 'unassigned' flag in virDomainHostdevDef makes it easier to distinguish between the case we're handling here versus the case in which a PCI hostdev has no <address> element. Both are interpreted as VIR_DOMAIN_DEF_FORMAT_ADDRESS_NONE in the existing code, thus reducing the logic to a flag makes it easier to handle this new use case. In the next patch we'll use the 'unassigned' flag to filter the hostdevs being initialized in the QEMU command line.
This paragraph illustrates why I think this approach is a bad idea. It is overloading ADDRESS_NONE to have 2 separate meanings. When we've done such things in the past we've usually ended up regretting it.
We're using ADDRESS_NONE as our signal that the user does not care about the address and thus libvirt should automatically assign one. That's consistent across all the different types of devices that we have.
I don't want to see us add special semantics for this just for host devices.
The fact that you stil lneeded a "bool unassigned" flag to be stored in the virDomainHostdevDef is a clear sign that this must be exposed as a distinct XML attribute.
I see you in fact did exactly this in the previous version of this patch series, so I'd like to go back to that version.
Do you suggest to keep exactly like it was in v1, an extra parameter in the subsys element? We can for example create a new address type, "unassigned", to avoid overloading ADDRESS_NONE like I'm doing here. I'm suggesting this because, back in v1, I had to take a peak in the <address> element to avoid the case where the user would assign an address to an unassigned device. Since I'll need to took in the <address> element anyway, a new "unassigned" address type would make it easier to cover this cases. It's also a bit less verbose than adding "assigned='yes|no'" in the end of the <hostdev> element. I could also say that this would also be easier for other device types to use, but I can't come up with a valid use case for anything other than a PCI hostdev to be unassignable. Thanks, DHB
Regards, Daniel

Previous patch made it possible for the QEMU driver to check if a given PCI hostdev is unassigned=true, meaning that this device shouldn't be part of the actual guest launch and alias generation. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_alias.c | 6 ++++ src/qemu/qemu_command.c | 4 +++ .../hostdev-pci-address-none.args | 31 +++++++++++++++++++ tests/qemuxml2argvtest.c | 4 +++ 4 files changed, 45 insertions(+) create mode 100644 tests/qemuxml2argvdata/hostdev-pci-address-none.args diff --git a/src/qemu/qemu_alias.c b/src/qemu/qemu_alias.c index 93bdcb7548..8733db4cf5 100644 --- a/src/qemu/qemu_alias.c +++ b/src/qemu/qemu_alias.c @@ -603,6 +603,12 @@ qemuAssignDeviceAliases(virDomainDefPtr def, virQEMUCapsPtr qemuCaps) return -1; } for (i = 0; i < def->nhostdevs; i++) { + /* do not assign alias for PCI multifunction hostdevs that + * will not be assigned to the guest. + */ + if (def->hostdevs[i]->unassigned) + continue; + /* we can't start assigning at 0, since netdevs may have used * up some hostdevN entries already. Also if the HostdevDef is * linked to a NetDef, they will share an info and the alias diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index ca1bd12594..d27ca39cbf 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -5418,6 +5418,10 @@ qemuBuildHostdevCommandLine(virCommandPtr cmd, } } + /* Ignore unassigned devices */ + if (hostdev->unassigned) + continue; + unsigned int bootIndex = hostdev->info->bootIndex; /* bootNet will be non-0 if boot order was set and no other diff --git a/tests/qemuxml2argvdata/hostdev-pci-address-none.args b/tests/qemuxml2argvdata/hostdev-pci-address-none.args new file mode 100644 index 0000000000..14fa3d6b37 --- /dev/null +++ b/tests/qemuxml2argvdata/hostdev-pci-address-none.args @@ -0,0 +1,31 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/tmp/lib/domain--1-delete \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/tmp/lib/domain--1-delete/.local/share \ +XDG_CACHE_HOME=/tmp/lib/domain--1-delete/.cache \ +XDG_CONFIG_HOME=/tmp/lib/domain--1-delete/.config \ +QEMU_AUDIO_DRV=none \ +/usr/bin/qemu-system-x86_64 \ +-name delete \ +-S \ +-machine pc,accel=kvm,usb=off,dump-guest-core=off \ +-m 256 \ +-realtime mlock=off \ +-smp 4,sockets=4,cores=1,threads=1 \ +-uuid 583a8e8e-f0ce-4f53-89ab-092862148b25 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,path=/tmp/lib/domain--1-delete/monitor.sock,\ +server,nowait \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-no-acpi \ +-usb \ +-device vfio-pci,host=0005:90:01.0,id=hostdev0,bus=pci.0,addr=0x3 \ +-device vfio-pci,host=0005:90:01.2,id=hostdev1,bus=pci.0,addr=0x4 \ +-device vfio-pci,host=0005:90:01.3,id=hostdev2,bus=pci.0,addr=0x5 \ +-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index 07e711840d..ea9c440f5f 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -1326,6 +1326,10 @@ mymain(void) QEMU_CAPS_KVM, QEMU_CAPS_DEVICE_VFIO_PCI); + DO_TEST("hostdev-pci-address-none", + QEMU_CAPS_KVM, + QEMU_CAPS_DEVICE_VFIO_PCI); + DO_TEST("serial-file-log", QEMU_CAPS_CHARDEV_FILE_APPEND, QEMU_CAPS_DEVICE_ISA_SERIAL, -- 2.21.0

virHostdevPreparePCIDevices verifies if a PCI hostdev "A" is in use by another domain by checking the 'activePCIHostdevs' list. It also verifies if other domains are using any other hostdev that belongs to the same IOMMU of "A". This is not enough to cover all the cases. Suppose a PCI hostdev "B" that shares the IOMMU with "A". We are currently able to abort guest launch if "B" is being used by another domain, but if "B" is not being used by any domain, we'll proceed guest execution. What happens then is: if "A" is being used in managed mode, the guest will fail to launch because Libvirt didn't detach "B" from the host. If "A" is being used in un-managed mode, the guest might fail or succeed to launch depending on whether both "A" and "B" were detached manually prior to guest start. Libvirt is now able to deal with these scenarios in a homogeneous fashion, providing the same behavior for both managed and un-managed mode while allowing parcial assignment for both cases as well. This patch changes virHostdevPreparePCIDevices to check all the PCI hostdevs of the domain against all the PCI hostdevs of the IOMMU, failing to launch if the domain does not contain all the PCI hostdevs declared, in both managed and un-managed cases. After this patch, any existing domains that were using PCI hostdevs with un-managed mode, and were getting away with partial assignment without declaring all the IOMMU PCI hostdevs in the domain XML, will be forced to do so. The missing PCI hostdevs can be declared with "<address type='none'/>" to keep them invisible to the guest, making the guest oblivious of this change. This is an annoyance for such domains, but the payoff is more consistency of behavior between managed and un-managed mode and more clarity on the domain XML, forcing the admin to declare all PCI hostdevs of the IOMMU and knowing exactly what will be detached from the host. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/util/virhostdev.c | 64 +++++++++++++++++-- ...ostdev-pci-multifunction-partial-fail.args | 31 +++++++++ ...hostdev-pci-multifunction-partial-fail.xml | 35 ++++++++++ tests/qemuxml2argvtest.c | 4 ++ 4 files changed, 127 insertions(+), 7 deletions(-) create mode 100644 tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.args create mode 100644 tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.xml diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index 17eb290825..eff3ecb839 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -56,6 +56,13 @@ struct virHostdevIsPCINodeDeviceUsedData { bool usesVFIO; }; +struct virHostdevIsAllIOMMUGroupUsedData { + virPCIDeviceListPtr pcidevs; + const char *domainName; + const char *deviceName; + int iommuGroup; +}; + /* This module makes heavy use of bookkeeping lists contained inside a * virHostdevManager instance to keep track of the devices' status. To make * it easy to spot potential ownership errors when moving devices from one @@ -111,6 +118,27 @@ static int virHostdevIsPCINodeDeviceUsed(virPCIDeviceAddressPtr devAddr, void *o return 0; } +static int virHostdevIsAllIOMMUGroupUsed(virPCIDeviceAddressPtr devAddr, void *opaque) +{ + struct virHostdevIsAllIOMMUGroupUsedData *helperData = opaque; + virPCIDevicePtr actual; + + actual = virPCIDeviceListFindByIDs(helperData->pcidevs, + devAddr->domain, devAddr->bus, + devAddr->slot, devAddr->function); + if (actual) { + return 0; + } else { + virReportError(VIR_ERR_OPERATION_INVALID, + _("All devices of the same IOMMU group %d of " + "the PCI device %s must belong to domain %s"), + helperData->iommuGroup, + helperData->deviceName, + helperData->domainName); + return -1; + } +} + static int virHostdevManagerOnceInit(void) { if (!VIR_CLASS_NEW(virHostdevManager, virClassForObject())) @@ -724,9 +752,10 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, unsigned int flags) { g_autoptr(virPCIDeviceList) pcidevs = NULL; + g_autofree unsigned int *searchedIOMMUs = NULL; int last_processed_hostdev_vf = -1; - size_t i; - int ret = -1; + size_t i, j; + int ret = -1, nSearchedIOMMUs = 0; virPCIDeviceAddressPtr devAddr = NULL; if (!nhostdevs) @@ -735,6 +764,8 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, if (!(pcidevs = virHostdevGetPCIHostDeviceList(hostdevs, nhostdevs))) return -1; + searchedIOMMUs = g_new0(unsigned int, virPCIDeviceListCount(pcidevs)); + virObjectLock(mgr->activePCIHostdevs); virObjectLock(mgr->inactivePCIHostdevs); @@ -778,13 +809,32 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, goto cleanup; /* VFIO devices belonging to same IOMMU group can't be - * shared across guests. Check if that's the case. */ + * shared across guests, and all of the must belong to the + * same domain. If a device isn't going to be assigned + * (flag unassigned is true) the guest will not use it, but + * the actual device must be detached together from the host + * anyway. */ if (usesVFIO) { - data.usesVFIO = true; - if (virPCIDeviceAddressIOMMUGroupIterate(devAddr, - virHostdevIsPCINodeDeviceUsed, - &data) < 0) + int devIOMMUGroup = virPCIDeviceAddressGetIOMMUGroupNum(devAddr); + struct virHostdevIsAllIOMMUGroupUsedData helper = { + pcidevs, dom_name, virPCIDeviceGetName(pci), devIOMMUGroup}; + bool alreadySearched = false; + + for (j = 0; j < nSearchedIOMMUs; j++) { + if (devIOMMUGroup == searchedIOMMUs[j]) { + alreadySearched = true; + break; + } + } + + if (alreadySearched) + continue; + + if (virPCIDeviceAddressIOMMUGroupIterate( + devAddr, virHostdevIsAllIOMMUGroupUsed, &helper) < 0) goto cleanup; + + searchedIOMMUs[nSearchedIOMMUs++] = devIOMMUGroup; } } diff --git a/tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.args b/tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.args new file mode 100644 index 0000000000..4074968c84 --- /dev/null +++ b/tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.args @@ -0,0 +1,31 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/tmp/lib/domain--1-delete \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/tmp/lib/domain--1-delete/.local/share \ +XDG_CACHE_HOME=/tmp/lib/domain--1-delete/.cache \ +XDG_CONFIG_HOME=/tmp/lib/domain--1-delete/.config \ +QEMU_AUDIO_DRV=none \ +/usr/bin/qemu-system-x86_64 \ +-name delete \ +-S \ +-machine pc,accel=kvm,usb=off,dump-guest-core=off \ +-m 256 \ +-realtime mlock=off \ +-smp 4,sockets=4,cores=1,threads=1 \ +-uuid 583a8e8e-f0ce-4f53-89ab-092862148b25 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,path=/tmp/lib/domain--1-delete/monitor.sock,\ +server,nowait \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-no-acpi \ +-usb \ +-device vfio-pci,host=0005:90:01.0,id=hostdev0,bus=pci.0,addr=0x3 \ +-device vfio-pci,host=0005:90:01.1,id=hostdev1,bus=pci.0,addr=0x4 \ +-device vfio-pci,host=0005:90:01.3,id=hostdev2,bus=pci.0,addr=0x5 \ +-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 diff --git a/tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.xml b/tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.xml new file mode 100644 index 0000000000..fc3e3bb520 --- /dev/null +++ b/tests/qemuxml2argvdata/hostdev-pci-multifunction-partial-fail.xml @@ -0,0 +1,35 @@ +<domain type='kvm'> + <name>delete</name> + <uuid>583a8e8e-f0ce-4f53-89ab-092862148b25</uuid> + <memory unit='KiB'>262144</memory> + <vcpu placement='static'>4</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + </os> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='usb' index='0'/> + <controller type='ide' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x0'/> + </source> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x1'/> + </source> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x3'/> + </source> + </hostdev> + </devices> +</domain> diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index ea9c440f5f..46a0ac15f0 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -1330,6 +1330,10 @@ mymain(void) QEMU_CAPS_KVM, QEMU_CAPS_DEVICE_VFIO_PCI); + DO_TEST("hostdev-pci-multifunction-partial-fail", + QEMU_CAPS_KVM, + QEMU_CAPS_DEVICE_VFIO_PCI); + DO_TEST("serial-file-log", QEMU_CAPS_CHARDEV_FILE_APPEND, QEMU_CAPS_DEVICE_ISA_SERIAL, -- 2.21.0

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- docs/formatdomain.html.in | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index a1042a314c..65012aac88 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -4137,6 +4137,21 @@ attributes: <code>iobase</code> and <code>irq</code>. <span class="since">Since 1.2.1</span> </dd> + <dt><code>none</code></dt> + <dd>For PCI hostdevs, <code><address type='none'/></code> has + a special meaning. A PCI hostdev must be assigned to the domain + with all the devices of its IOMMU group. If the admin so chooses, + it is possible to filter which devices the guest OS will actually + see, while the Libvirt domain is retaining ownership of all the + devices of the IOMMU group. Passing address type <code>none</code> + in such cases will tell Libvirt that the device must not be + assigned to the guest. This avoids scenarios in which the admin + might be compelled to use a ACS patch to remove the device + from the guest. + <code><address type='none'/></code> is an invalid address + type for all other device types. + <span class="since">Since 5.10.0</span> + </dd> </dl> <h4><a id="elementsVirtio">Virtio-related options</a></h4> -- 2.21.0

On Thu, 21 Nov 2019 19:19:13 -0300 Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
Changes from previous version [1], all of them result of feedback from Alex Williamson and Abdulla Bubshait: - use <address type='none'/> instead of creating a new subsys attribute; - expand the change to all PCI hostdevs. Former patch 01 was discarded because we don't need the PCI Multifunction helpers for now; - series changed name to reflect what it's being done - new patch 04: add documentation to formatdomain.html.in
To avoid a huge wall of text please refer to [1] for context about the road up to here. Commit msgs of the first 3 patches tells the story as well.
[1] https://www.redhat.com/archives/libvir-list/2019-October/msg00298.html
What I want to discuss here instead is a caveat that I've found while testing this work, since its first version. This test was done in a Power 8 system with a Broadcom BCM5719 PCIe Multifunction card, with 4 virtual functions. This series enables Libvirt to declare PCI hostdevs that will not be visible to the guest using address type='none'. During the tests I faced a scenario that I expected to fail, but it didn't. This is the relevant XML except:
<hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x09' slot='0x00' function='0x0'/> </source> <address type='none'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x09' slot='0x00' function='0x1'/> </source> <address type='none'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x09' slot='0x00' function='0x2'/> </source> <address type='pci' domain='0x0000' bus='0x01' slot='0x01' function='0x2'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0001' bus='0x09' slot='0x00' function='0x3'/> </source> <address type='none'/> </hostdev>
I'm declaring all the BCM5719 functions in the XML, but I am making functions 0, 1 and 3 unassignable by the guest using address type='none'. This test was meant to fail, but it didn't. To my surprise the guest booted and the device is functional:
$ lspci 0000:00:01.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon 0000:00:03.0 USB controller: Red Hat, Inc. QEMU XHCI Host Controller (rev 01) 0000:00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device 0001:00:01.2 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) $
I've talked with Michael Roth (QEMU PPC64 developer that worked with the PCI multifunction hotplug/unplug support in the PPC64 machine) about this. He mentioned that this is intended. I'll quote here what he had to say about it:
"The current logic is that we only emit the hotplug event when function 0 is attached, but if some other function is attached at boot-time the guest will still see it on the bus, and whether that works or not I think is up to the device/driver"
This explains why this test didn't fail as I expected. At least for the PPC64 machine, depending on the device support, this setup is allowed. PPC64 machine uses function 0 hotplug as a signal of 'plug all the queue functions and function 0', but function 0 isn't required at boot time. I would like to hear other opinions in this because I can't say whether this is allowed in x86.
I would expect that on x86 a Linux guest would need to be booted with the pci=pcie_scan_all kernel option to find this device. The PCI-core will only scan for devfn 00.0 behind a downstream port and then only scan non-zero functions if that device indicate multifunction support. I'd expect more archs to behave this way than what you see on PPC where it "just works".
I am mentioning all this now because this had a direct impact on the design of this work since the previous version, and I failed to bring it up back then. I am *not* checking for the assignment of function 0 at guest boot time in Libvirt, leaving the user free to decide what to do. I am aware that this will be inconsistent to the logic of the PCI multifunction hotplug/unplug support, where function 0 is required. This also puts a lot of faith in the user, relying that the user is fully aware of the capabilities of the hardware.
My question is: should Libvirt force function 0 to be present in boot time as well, regardless of whether the PPC64 guest or some cards are able to boot without it?
In my reading of the PCI 3.0 spec, 3.2.2.3.4 indicates that multifunction devices are required to implement function 0 and that configuration software is under no obligation to scan for higher number functions if function 0 is not present and does not have the multifunction bit set in the header type register. With PCIe, where link information, payload size, ARI, VC, etc are exposed in config space, many of these are only valid or have specific means only for function 0. What you're seeing on PPC is, I think, more typical of paravirtualized PCI enumeration, where you're only seeing a view of the bus as allowed by a hypervisor. I can't say whether the pcie_scan_all boot option was added to allow discovery of devices as in your configuration or simply to correct cases where function 0 forgot to implement the multifunction bit. Whether libvirt wants to prevent this is more of a support question, it seems spec compliant hardware should never do this, but not all hardware is spec compliant. Libvirt should never generate such a configuration on it's own, but I wouldn't necessarily object if it allows a user to shoot themselves in the foot. Thanks, Alex

On 11/21/19 8:08 PM, Alex Williamson wrote:
On Thu, 21 Nov 2019 19:19:13 -0300 Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
[...]
My question is: should Libvirt force function 0 to be present in boot time as well, regardless of whether the PPC64 guest or some cards are able to boot without it?
In my reading of the PCI 3.0 spec, 3.2.2.3.4 indicates that multifunction devices are required to implement function 0 and that configuration software is under no obligation to scan for higher number functions if function 0 is not present and does not have the multifunction bit set in the header type register. With PCIe, where link information, payload size, ARI, VC, etc are exposed in config space, many of these are only valid or have specific means only for function 0.
What you're seeing on PPC is, I think, more typical of paravirtualized PCI enumeration, where you're only seeing a view of the bus as allowed by a hypervisor. I can't say whether the pcie_scan_all boot option was added to allow discovery of devices as in your configuration or simply to correct cases where function 0 forgot to implement the multifunction
Just checked. Neither the Power 8 host nor the guest is booting with the pcie_scan_all option.
bit.Whether libvirt wants to prevent this is more of a support question, it seems spec compliant hardware should never do this, but not all hardware is spec compliant. Libvirt should never generate such a configuration on it's own, but I wouldn't necessarily object if it allows a user to shoot themselves in the foot. Thanks,
I am not so thrilled about it after what you said. It seems that the PPC64 guest is behaving differently from all other archs in this case, and I'd rather make PPC64 equal to everyone else rather than allowing the PPC64 user to expect that non-compliant behavior is allowed. Thanks, DHB
Alex
participants (3)
-
Alex Williamson
-
Daniel Henrique Barboza
-
Daniel P. Berrangé