[libvirt] [PATCH v3 0/6] PCI hostdev partial assignment support

changes from previous version [1]: - do not overload address type='none'. A new address type called 'unassigned' is introduced; - there is no unassigned' flag being created in virDomainHostdevDef. The logic added by new address type is enough; - do not allow function zero of multifunction devices to be unassigned. Nothing too special to discuss in this cover letter. More details can be found at the discussions of the previous version [1]. Commit messages of the patches have more background info as well. [1] https://www.redhat.com/archives/libvir-list/2019-November/msg01099.html Daniel Henrique Barboza (6): Introducing new address type='unassigned' for PCI hostdevs qemu: handle unassigned PCI hostdevs in command line and alias virhostdev.c: check all IOMMU devs in virHostdevPreparePCIDevices formatdomain.html.in: document <address type='unassigned'/> utils: PCI multifunction detection helpers domain_conf.c: don't allow function zero to be unassigned docs/formatdomain.html.in | 14 +++ docs/schemas/domaincommon.rng | 5 ++ src/conf/device_conf.c | 2 + src/conf/device_conf.h | 1 + src/conf/domain_conf.c | 20 ++++- src/libvirt_private.syms | 2 + src/qemu/qemu_alias.c | 6 ++ src/qemu/qemu_command.c | 5 ++ src/qemu/qemu_domain.c | 1 + src/qemu/qemu_domain_address.c | 5 ++ src/util/virhostdev.c | 88 +++++++++++++++++-- src/util/virhostdev.h | 3 + src/util/virpci.c | 17 ++++ src/util/virpci.h | 2 + .../hostdev-pci-address-unassigned.args | 31 +++++++ .../hostdev-pci-address-unassigned.xml | 42 +++++++++ ...pci-multifunction-zero-unassigned-fail.xml | 42 +++++++++ tests/qemuxml2argvtest.c | 8 ++ .../hostdev-pci-address-unassigned.xml | 58 ++++++++++++ tests/qemuxml2xmltest.c | 1 + tests/virpcimock.c | 9 +- 21 files changed, 351 insertions(+), 11 deletions(-) create mode 100644 tests/qemuxml2argvdata/hostdev-pci-address-unassigned.args create mode 100644 tests/qemuxml2argvdata/hostdev-pci-address-unassigned.xml create mode 100644 tests/qemuxml2argvdata/hostdev-pci-multifunction-zero-unassigned-fail.xml create mode 100644 tests/qemuxml2xmloutdata/hostdev-pci-address-unassigned.xml -- 2.23.0

Today, to use a PCI hostdev "A" in a domain, all PCI devices that belongs to the same IOMMU group must also be declared in the domain XML, meaning that all IOMMU devices are detached from the host and all of them are visible to the guest. The result is that the guest will have access to all devices, but this is not necessarily what the administrator wanted. If only the hostdev "A" was intended for guest usage, but hostdevs "B" and "C" happened to be in the same IOMMU group of "A", the guest will gain access to all 3 devices. This makes the administrator rely on alternative solutions, such as use all hostdevs with un-managed mode and detached all the IOMMU before the guest starts. If use un-managed mode is not an option, the only alternative left is an ACS patch to deny guest access to "B" and "C". This patch introduces a new address type called "unassigned" to handle this situation where a hostdev will be owned by a domain, but not visible to the guest OS. This allows the administrator to declare all the IOMMU while also choosing which hostdevs will be usable by the guest. This new mechanic applies to all PCI hostdevs, regardless of whether they are a PCI multifunction hostdev or not. Using <address type='unassigned'/> in any case other than a PCI hostdev will result in error. Next patch will use this new address type in the QEMU driver to avoid adding unassigned devices to the QEMU launch command line. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- docs/schemas/domaincommon.rng | 5 ++ src/conf/device_conf.c | 2 + src/conf/device_conf.h | 1 + src/conf/domain_conf.c | 7 ++- src/qemu/qemu_command.c | 1 + src/qemu/qemu_domain.c | 1 + src/qemu/qemu_domain_address.c | 5 ++ .../hostdev-pci-address-unassigned.xml | 42 ++++++++++++++ .../hostdev-pci-address-unassigned.xml | 58 +++++++++++++++++++ tests/qemuxml2xmltest.c | 1 + 10 files changed, 122 insertions(+), 1 deletion(-) create mode 100644 tests/qemuxml2argvdata/hostdev-pci-address-unassigned.xml create mode 100644 tests/qemuxml2xmloutdata/hostdev-pci-address-unassigned.xml diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index a83c9ae7a5..fa0e58f136 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -5463,6 +5463,11 @@ </attribute> <ref name="dimmaddress"/> </group> + <group> + <attribute name="type"> + <value>unassigned</value> + </attribute> + </group> </choice> </element> </define> diff --git a/src/conf/device_conf.c b/src/conf/device_conf.c index 4c57f0995f..4dbd5c1ac9 100644 --- a/src/conf/device_conf.c +++ b/src/conf/device_conf.c @@ -45,6 +45,7 @@ VIR_ENUM_IMPL(virDomainDeviceAddress, "virtio-mmio", "isa", "dimm", + "unassigned", ); static int @@ -120,6 +121,7 @@ virDomainDeviceInfoAddressIsEqual(const virDomainDeviceInfo *a, /* address types below don't have any specific data */ case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_MMIO: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_S390: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED: break; case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_PCI: diff --git a/src/conf/device_conf.h b/src/conf/device_conf.h index d98fae775c..e091d7cfe2 100644 --- a/src/conf/device_conf.h +++ b/src/conf/device_conf.h @@ -45,6 +45,7 @@ typedef enum { VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_MMIO, VIR_DOMAIN_DEVICE_ADDRESS_TYPE_ISA, VIR_DOMAIN_DEVICE_ADDRESS_TYPE_DIMM, + VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED, VIR_DOMAIN_DEVICE_ADDRESS_TYPE_LAST } virDomainDeviceAddressType; diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 9580884747..52b6c40c44 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -6286,9 +6286,11 @@ virDomainHostdevDefValidate(const virDomainHostdevDef *hostdev) switch ((virDomainHostdevSubsysType) hostdev->source.subsys.type) { case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI: if (hostdev->info->type != VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE && + hostdev->info->type != VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED && hostdev->info->type != VIR_DOMAIN_DEVICE_ADDRESS_TYPE_PCI) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("PCI host devices must use 'pci' address type")); + _("PCI host devices must use 'pci' or " + "'unassigned' address type")); return -1; } break; @@ -7307,6 +7309,7 @@ virDomainDeviceInfoFormat(virBufferPtr buf, case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_S390: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_LAST: break; } @@ -7507,6 +7510,7 @@ virDomainDeviceAddressParseXML(xmlNodePtr address, break; case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_LAST: break; } @@ -21858,6 +21862,7 @@ virDomainDeviceInfoCheckABIStability(virDomainDeviceInfoPtr src, case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_CCW: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_MMIO: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_LAST: break; } diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 3465d28b84..4935502e7b 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -461,6 +461,7 @@ qemuBuildVirtioDevStr(virBufferPtr buf, return -1; case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_LAST: default: virReportEnumRangeError(virDomainDeviceAddressType, info->type); diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index c233a4ba96..549d04742b 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -7262,6 +7262,7 @@ qemuDomainDeviceDefValidateAddress(const virDomainDeviceDef *dev, case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_MMIO: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_ISA: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_DIMM: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED: /* No validation for these address types yet */ break; diff --git a/src/qemu/qemu_domain_address.c b/src/qemu/qemu_domain_address.c index 605984f80f..b077240899 100644 --- a/src/qemu/qemu_domain_address.c +++ b/src/qemu/qemu_domain_address.c @@ -2313,6 +2313,11 @@ qemuDomainAssignDevicePCISlots(virDomainDefPtr def, continue; } + /* do not reserve address for info->type='unassigned' */ + if (def->hostdevs[i]->info->type == + VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED) + continue; + if (qemuDomainPCIAddressReserveNextAddr(addrs, def->hostdevs[i]->info) < 0) goto error; diff --git a/tests/qemuxml2argvdata/hostdev-pci-address-unassigned.xml b/tests/qemuxml2argvdata/hostdev-pci-address-unassigned.xml new file mode 100644 index 0000000000..9a2685ca0e --- /dev/null +++ b/tests/qemuxml2argvdata/hostdev-pci-address-unassigned.xml @@ -0,0 +1,42 @@ +<domain type='kvm'> + <name>delete</name> + <uuid>583a8e8e-f0ce-4f53-89ab-092862148b25</uuid> + <memory unit='KiB'>262144</memory> + <vcpu placement='static'>4</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + </os> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='usb' index='0'/> + <controller type='ide' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x0'/> + </source> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x1'/> + </source> + <address type='unassigned'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x2'/> + </source> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x3'/> + </source> + </hostdev> + </devices> +</domain> diff --git a/tests/qemuxml2xmloutdata/hostdev-pci-address-unassigned.xml b/tests/qemuxml2xmloutdata/hostdev-pci-address-unassigned.xml new file mode 100644 index 0000000000..2341e8432b --- /dev/null +++ b/tests/qemuxml2xmloutdata/hostdev-pci-address-unassigned.xml @@ -0,0 +1,58 @@ +<domain type='kvm'> + <name>delete</name> + <uuid>583a8e8e-f0ce-4f53-89ab-092862148b25</uuid> + <memory unit='KiB'>262144</memory> + <currentMemory unit='KiB'>262144</currentMemory> + <vcpu placement='static'>4</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='usb' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> + </controller> + <controller type='ide' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/> + </controller> + <controller type='pci' index='0' model='pci-root'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x0'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x1'/> + </source> + <address type='unassigned'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x2'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x3'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> + </hostdev> + <memballoon model='virtio'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> + </memballoon> + </devices> +</domain> diff --git a/tests/qemuxml2xmltest.c b/tests/qemuxml2xmltest.c index 8b43f35f06..7c53d870d5 100644 --- a/tests/qemuxml2xmltest.c +++ b/tests/qemuxml2xmltest.c @@ -431,6 +431,7 @@ mymain(void) DO_TEST("hostdev-usb-address", NONE); DO_TEST("hostdev-pci-address", NONE); + DO_TEST("hostdev-pci-address-unassigned", NONE); DO_TEST("hostdev-pci-multifunction", NONE); DO_TEST("hostdev-vfio", NONE); DO_TEST("hostdev-vfio-zpci", -- 2.23.0

Previous patch made it possible for the QEMU driver to check if a given PCI hostdev is unassigned, by checking if dev->info->type is VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED, meaning that this device shouldn't be part of the actual guest launch and alias generation. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_alias.c | 6 ++++ src/qemu/qemu_command.c | 4 +++ .../hostdev-pci-address-unassigned.args | 31 +++++++++++++++++++ tests/qemuxml2argvtest.c | 4 +++ 4 files changed, 45 insertions(+) create mode 100644 tests/qemuxml2argvdata/hostdev-pci-address-unassigned.args diff --git a/src/qemu/qemu_alias.c b/src/qemu/qemu_alias.c index 93bdcb7548..6139f4c9fb 100644 --- a/src/qemu/qemu_alias.c +++ b/src/qemu/qemu_alias.c @@ -603,6 +603,12 @@ qemuAssignDeviceAliases(virDomainDefPtr def, virQEMUCapsPtr qemuCaps) return -1; } for (i = 0; i < def->nhostdevs; i++) { + /* Do not assign alias for PCI hostdevs that will not be + * assigned to the guest. */ + if (def->hostdevs[i]->info->type == + VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED) + continue; + /* we can't start assigning at 0, since netdevs may have used * up some hostdevN entries already. Also if the HostdevDef is * linked to a NetDef, they will share an info and the alias diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 4935502e7b..124bdc2323 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -5419,6 +5419,10 @@ qemuBuildHostdevCommandLine(virCommandPtr cmd, } } + /* Ignore unassigned devices */ + if (hostdev->info->type == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED) + continue; + unsigned int bootIndex = hostdev->info->bootIndex; /* bootNet will be non-0 if boot order was set and no other diff --git a/tests/qemuxml2argvdata/hostdev-pci-address-unassigned.args b/tests/qemuxml2argvdata/hostdev-pci-address-unassigned.args new file mode 100644 index 0000000000..14fa3d6b37 --- /dev/null +++ b/tests/qemuxml2argvdata/hostdev-pci-address-unassigned.args @@ -0,0 +1,31 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/tmp/lib/domain--1-delete \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/tmp/lib/domain--1-delete/.local/share \ +XDG_CACHE_HOME=/tmp/lib/domain--1-delete/.cache \ +XDG_CONFIG_HOME=/tmp/lib/domain--1-delete/.config \ +QEMU_AUDIO_DRV=none \ +/usr/bin/qemu-system-x86_64 \ +-name delete \ +-S \ +-machine pc,accel=kvm,usb=off,dump-guest-core=off \ +-m 256 \ +-realtime mlock=off \ +-smp 4,sockets=4,cores=1,threads=1 \ +-uuid 583a8e8e-f0ce-4f53-89ab-092862148b25 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,path=/tmp/lib/domain--1-delete/monitor.sock,\ +server,nowait \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-no-acpi \ +-usb \ +-device vfio-pci,host=0005:90:01.0,id=hostdev0,bus=pci.0,addr=0x3 \ +-device vfio-pci,host=0005:90:01.2,id=hostdev1,bus=pci.0,addr=0x4 \ +-device vfio-pci,host=0005:90:01.3,id=hostdev2,bus=pci.0,addr=0x5 \ +-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index a2791d0460..5ee9c9b2dc 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -1326,6 +1326,10 @@ mymain(void) QEMU_CAPS_KVM, QEMU_CAPS_DEVICE_VFIO_PCI); + DO_TEST("hostdev-pci-address-unassigned", + QEMU_CAPS_KVM, + QEMU_CAPS_DEVICE_VFIO_PCI); + DO_TEST("serial-file-log", QEMU_CAPS_CHARDEV_FILE_APPEND, QEMU_CAPS_DEVICE_ISA_SERIAL, -- 2.23.0

virHostdevPreparePCIDevices verifies if a PCI hostdev "A" is in use by another domain by checking the 'activePCIHostdevs' list. It also verifies if other domains are using any other hostdev that belongs to the same IOMMU of "A". This is not enough to cover all the cases. Suppose a PCI hostdev "B" that shares the IOMMU with "A". We are currently able to abort guest launch if "B" is being used by another domain, but if "B" is not being used by any domain, we'll proceed guest execution. What happens then is: if "A" is being used in managed mode, the guest will fail to launch because Libvirt didn't detach "B" from the host. If "A" is being used in un-managed mode, the guest might fail or succeed to launch depending on whether both "A" and "B" were detached manually prior to guest start. Libvirt is now able to deal with these scenarios in a homogeneous fashion, providing the same behavior for both managed and un-managed mode while allowing parcial assignment for both cases as well. This patch changes virHostdevPreparePCIDevices to check all the PCI hostdevs of the domain against all the PCI hostdevs of the IOMMU, failing to launch if the domain does not contain all the PCI hostdevs declared, in both managed and un-managed cases. After this patch, any existing domains that were using PCI hostdevs with un-managed mode, and were getting away with partial assignment without declaring all the IOMMU PCI hostdevs in the domain XML, will be forced to do so. The missing PCI hostdevs can be declared with "<address type='unassigned'/>" to keep them invisible to the guest, making the guest oblivious of this change. This is an annoyance for such domains, but the payoff is more consistency of behavior between managed and un-managed mode and more clarity on the domain XML, forcing the admin to declare all PCI hostdevs of the IOMMU and knowing exactly what will be detached from the host. This is an example of this newly added error condition when the domain does not declare all the PCI hostdevs of the same IOMMU: $ sudo ./run tools/virsh start multipci-coldplug-partial-fail error: Failed to start domain multipci-coldplug-partial-fail error: Requested operation is not valid: All devices of the same IOMMU group 1 of the PCI device 0001:09:00.0 must belong to domain multipci-coldplug-partial-fail $ Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/util/virhostdev.c | 64 ++++++++++++++++++++++++++++++++++++++----- 1 file changed, 57 insertions(+), 7 deletions(-) diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index 39e6b8f49f..2e7885112f 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -56,6 +56,13 @@ struct virHostdevIsPCINodeDeviceUsedData { bool usesVFIO; }; +struct virHostdevIsAllIOMMUGroupUsedData { + virPCIDeviceListPtr pcidevs; + const char *domainName; + const char *deviceName; + int iommuGroup; +}; + /* This module makes heavy use of bookkeeping lists contained inside a * virHostdevManager instance to keep track of the devices' status. To make * it easy to spot potential ownership errors when moving devices from one @@ -111,6 +118,27 @@ static int virHostdevIsPCINodeDeviceUsed(virPCIDeviceAddressPtr devAddr, void *o return 0; } +static int virHostdevIsAllIOMMUGroupUsed(virPCIDeviceAddressPtr devAddr, void *opaque) +{ + struct virHostdevIsAllIOMMUGroupUsedData *helperData = opaque; + virPCIDevicePtr actual; + + actual = virPCIDeviceListFindByIDs(helperData->pcidevs, + devAddr->domain, devAddr->bus, + devAddr->slot, devAddr->function); + if (actual) { + return 0; + } else { + virReportError(VIR_ERR_OPERATION_INVALID, + _("All devices of the same IOMMU group %d of " + "the PCI device %s must belong to domain %s"), + helperData->iommuGroup, + helperData->deviceName, + helperData->domainName); + return -1; + } +} + static int virHostdevManagerOnceInit(void) { if (!VIR_CLASS_NEW(virHostdevManager, virClassForObject())) @@ -724,9 +752,10 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, unsigned int flags) { g_autoptr(virPCIDeviceList) pcidevs = NULL; + g_autofree unsigned int *searchedIOMMUs = NULL; int last_processed_hostdev_vf = -1; - size_t i; - int ret = -1; + size_t i, j; + int ret = -1, nSearchedIOMMUs = 0; virPCIDeviceAddressPtr devAddr = NULL; if (!nhostdevs) @@ -735,6 +764,8 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, if (!(pcidevs = virHostdevGetPCIHostDeviceList(hostdevs, nhostdevs))) return -1; + searchedIOMMUs = g_new0(unsigned int, virPCIDeviceListCount(pcidevs)); + virObjectLock(mgr->activePCIHostdevs); virObjectLock(mgr->inactivePCIHostdevs); @@ -778,13 +809,32 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, goto cleanup; /* VFIO devices belonging to same IOMMU group can't be - * shared across guests. Check if that's the case. */ + * shared across guests, and all of the must belong to the + * same domain. If a device isn't going to be assigned + * (flag unassigned is true) the guest will not use it, but + * the actual device must be detached together from the host + * anyway. */ if (usesVFIO) { - data.usesVFIO = true; - if (virPCIDeviceAddressIOMMUGroupIterate(devAddr, - virHostdevIsPCINodeDeviceUsed, - &data) < 0) + int devIOMMUGroup = virPCIDeviceAddressGetIOMMUGroupNum(devAddr); + struct virHostdevIsAllIOMMUGroupUsedData helper = { + pcidevs, dom_name, virPCIDeviceGetName(pci), devIOMMUGroup}; + bool alreadySearched = false; + + for (j = 0; j < nSearchedIOMMUs; j++) { + if (devIOMMUGroup == searchedIOMMUs[j]) { + alreadySearched = true; + break; + } + } + + if (alreadySearched) + continue; + + if (virPCIDeviceAddressIOMMUGroupIterate( + devAddr, virHostdevIsAllIOMMUGroupUsed, &helper) < 0) goto cleanup; + + searchedIOMMUs[nSearchedIOMMUs++] = devIOMMUGroup; } } -- 2.23.0

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- docs/formatdomain.html.in | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 6df4a8b26e..7625265ba8 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -4148,6 +4148,20 @@ attributes: <code>iobase</code> and <code>irq</code>. <span class="since">Since 1.2.1</span> </dd> + <dt><code>unassigned</code></dt> + <dd>For PCI hostdevs, <code><address type='unassigned'/></code> + has a special meaning. A PCI hostdev must be assigned to the domain + with all the devices of its IOMMU group. If the admin so chooses, + it is possible to filter which devices the guest OS will actually + see, while the Libvirt domain is retaining ownership of all the + devices of the IOMMU group. Using this address type in such cases + will tell Libvirt that the device must not be assigned to the guest. + This avoids scenarios in which the admin might be compelled to use + a ACS patch to remove the device from the guest. + <code><address type='unassigned'/></code> is an invalid address + type for all other device types. + <span class="since">Since 5.11.0</span> + </dd> </dl> <h4><a id="elementsVirtio">Virtio-related options</a></h4> -- 2.23.0

This patch introduces two helpers that will be used in the next patch. One is virpci.c:virPCIDeviceIsMultifunction(), and the other is virhostdev.c:virHostdevIsPCIMultifunctionDevice(). Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/libvirt_private.syms | 2 ++ src/util/virhostdev.c | 24 ++++++++++++++++++++++++ src/util/virhostdev.h | 3 +++ src/util/virpci.c | 17 +++++++++++++++++ src/util/virpci.h | 2 ++ 5 files changed, 48 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 8fe0bf9365..345c9747e8 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2137,6 +2137,7 @@ virHostCPUStatsAssign; # util/virhostdev.h virHostdevFindUSBDevice; virHostdevIsMdevDevice; +virHostdevIsPCIMultifunctionDevice; virHostdevIsSCSIDevice; virHostdevIsVFIODevice; virHostdevManagerGetDefault; @@ -2717,6 +2718,7 @@ virPCIDeviceGetUnbindFromStub; virPCIDeviceGetUsedBy; virPCIDeviceHasPCIExpressLink; virPCIDeviceIsAssignable; +virPCIDeviceIsMultifunction; virPCIDeviceIsPCIExpress; virPCIDeviceListAdd; virPCIDeviceListAddCopy; diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index 2e7885112f..803daaa995 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -2251,3 +2251,27 @@ virHostdevUpdateActiveDomainDevices(virHostdevManagerPtr mgr, return 0; } + +bool +virHostdevIsPCIMultifunctionDevice(virDomainHostdevDefPtr hostdev) +{ + g_autoptr(virPCIDevice) pciDev = NULL; + virDomainHostdevSubsysPCIPtr pcisrc = &hostdev->source.subsys.u.pci; + + if (hostdev->mode != VIR_DOMAIN_HOSTDEV_MODE_SUBSYS || + hostdev->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) + return false; + + /* Libvirt should be able to perform all the operations in + * virPCIDeviceNew() even if it's running unprivileged, so if this + * fails, the device apparently doesn't currently exist on the host. + * Since we can't speculate, assume this device is not multifunction. + */ + pciDev = virPCIDeviceNew(pcisrc->addr.domain, pcisrc->addr.bus, + pcisrc->addr.slot, pcisrc->addr.function); + + if (!pciDev) + return false; + + return virPCIDeviceIsMultifunction(pciDev); +} diff --git a/src/util/virhostdev.h b/src/util/virhostdev.h index 2d61c21e9d..c5a1fdcf5e 100644 --- a/src/util/virhostdev.h +++ b/src/util/virhostdev.h @@ -194,6 +194,9 @@ bool virHostdevIsMdevDevice(const virDomainHostdevDef *hostdev) ATTRIBUTE_NONNULL(1); bool +virHostdevIsPCIMultifunctionDevice(virDomainHostdevDefPtr hostdev) + ATTRIBUTE_NONNULL(1); +bool virHostdevIsVFIODevice(const virDomainHostdevDef *hostdev) ATTRIBUTE_NONNULL(1); diff --git a/src/util/virpci.c b/src/util/virpci.c index 9bea5a20d0..840cca77fc 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -2818,6 +2818,23 @@ int virPCIGetHeaderType(virPCIDevicePtr dev, int *hdrType) } +bool +virPCIDeviceIsMultifunction(virPCIDevicePtr dev) +{ + int fd; + uint8_t type; + + if ((fd = virPCIDeviceConfigOpen(dev)) < 0) + return -1; + + type = virPCIDeviceRead8(dev, fd, PCI_HEADER_TYPE); + + virPCIDeviceConfigClose(dev, fd); + + return type & PCI_HEADER_TYPE_MULTI; +} + + void virPCIEDeviceInfoFree(virPCIEDeviceInfoPtr dev) { diff --git a/src/util/virpci.h b/src/util/virpci.h index cfb4581edf..8f37636809 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -266,6 +266,8 @@ int virPCIDeviceGetLinkCapSta(virPCIDevicePtr dev, int virPCIGetHeaderType(virPCIDevicePtr dev, int *hdrType); +bool virPCIDeviceIsMultifunction(virPCIDevicePtr dev); + void virPCIEDeviceInfoFree(virPCIEDeviceInfoPtr dev); ssize_t virPCIGetMdevTypes(const char *sysfspath, -- 2.23.0

The recently added 'unassigned' address type for PCI hostdevs revealed an uncommon behavior in a Power 8 host with a Broadcom BCM5719 PCIe Multifunction card. With this setup, it is possible to passthrough the BCM5719 functions to the domain, but unassigning the function zero. This means that the guest is able to use non-zero functions of the device without having access to the function zero. This is the output of 'lspci' in this scenario where only function 2 is being assigned: $ lspci 0000:00:01.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon 0000:00:03.0 USB controller: Red Hat, Inc. QEMU XHCI Host Controller (rev 01) 0000:00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device 0001:00:01.2 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) $ More details can be read at [1]. In short, no other architecture is allowing this kind of behavior as PowerPC is displaying here. For x86, an extra 'pci=pcie_scan_all' kernel option would be needed for the guest to find this device, while in PowerPC this wasn't needed. There's the question of compliance with the PCI 3.0 spec that Alex Williamson brought up in [2] that would be hurt if we allow this type of use. Allowing the zero function to be unassigned will also conflict with the logic use for hotplug/unplug of multifunction devices, which is a work in progress in Libvirt, since in both hotplug and unplug we rely on function zero to trigger the operation in QEMU. With all these downsides, summed up with the fact that the only case we have proof where this might be a thing is a Power 8 host comboed with the BCM5719 card, let's forbid this kind of use in Libvirt altogether. [1] https://www.redhat.com/archives/libvir-list/2019-November/msg01099.html [2] https://www.redhat.com/archives/libvir-list/2019-November/msg01104.html Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/conf/domain_conf.c | 13 ++++++ ...pci-multifunction-zero-unassigned-fail.xml | 42 +++++++++++++++++++ tests/qemuxml2argvtest.c | 4 ++ tests/virpcimock.c | 9 ++-- 4 files changed, 65 insertions(+), 3 deletions(-) create mode 100644 tests/qemuxml2argvdata/hostdev-pci-multifunction-zero-unassigned-fail.xml diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 52b6c40c44..270c70eafb 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -15644,7 +15644,20 @@ virDomainHostdevDefParseXML(virDomainXMLOptionPtr xmlopt, | VIR_DOMAIN_DEF_PARSE_ALLOW_ROM) < 0) goto error; } + if (def->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS) { + /* Do not allow function 0 of a PCI multifunction device to + * be unassigned */ + if (virHostdevIsPCIMultifunctionDevice(def)) { + if (def->info->type == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED && + def->source.subsys.u.pci.addr.function == 0) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("Function zero of PCI Multifunction device " + "must always be assigned")); + goto error; + } + } + switch ((virDomainHostdevSubsysType) def->source.subsys.type) { case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_SCSI: if (virXPathBoolean("boolean(./readonly)", ctxt)) diff --git a/tests/qemuxml2argvdata/hostdev-pci-multifunction-zero-unassigned-fail.xml b/tests/qemuxml2argvdata/hostdev-pci-multifunction-zero-unassigned-fail.xml new file mode 100644 index 0000000000..1abce3b2e7 --- /dev/null +++ b/tests/qemuxml2argvdata/hostdev-pci-multifunction-zero-unassigned-fail.xml @@ -0,0 +1,42 @@ +<domain type='kvm'> + <name>delete</name> + <uuid>583a8e8e-f0ce-4f53-89ab-092862148b25</uuid> + <memory unit='KiB'>262144</memory> + <vcpu placement='static'>4</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + </os> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='usb' index='0'/> + <controller type='ide' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x0'/> + </source> + <address type='unassigned'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x1'/> + </source> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x2'/> + </source> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x3'/> + </source> + </hostdev> + </devices> +</domain> diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index 5ee9c9b2dc..163664d913 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -1330,6 +1330,10 @@ mymain(void) QEMU_CAPS_KVM, QEMU_CAPS_DEVICE_VFIO_PCI); + DO_TEST_PARSE_ERROR("hostdev-pci-multifunction-zero-unassigned-fail", + QEMU_CAPS_KVM, + QEMU_CAPS_DEVICE_VFIO_PCI); + DO_TEST("serial-file-log", QEMU_CAPS_CHARDEV_FILE_APPEND, QEMU_CAPS_DEVICE_ISA_SERIAL, diff --git a/tests/virpcimock.c b/tests/virpcimock.c index cd6ae1cff6..7be133d7a7 100644 --- a/tests/virpcimock.c +++ b/tests/virpcimock.c @@ -1000,9 +1000,12 @@ init_env(void) MAKE_PCI_DEVICE("0001:01:00.1", 0x8086, 0x105e, 5); MAKE_PCI_DEVICE("0005:80:00.0", 0x10b5, 0x8112, 6, .klass = 0x060400); MAKE_PCI_DEVICE("0005:90:01.0", 0x1033, 0x0035, 7); - MAKE_PCI_DEVICE("0005:90:01.1", 0x1033, 0x0035, 7); - MAKE_PCI_DEVICE("0005:90:01.2", 0x1033, 0x00e0, 7); - MAKE_PCI_DEVICE("0005:90:01.3", 0x1033, 0x00e0, 7); + MAKE_PCI_DEVICE("0005:90:01.1", 0x1033, 0x0035, 7, + .physfn = "0005:90:01.0"); /* Virtual Function */ + MAKE_PCI_DEVICE("0005:90:01.2", 0x1033, 0x00e0, 7, + .physfn = "0005:90:01.0"); /* Virtual Function */ + MAKE_PCI_DEVICE("0005:90:01.3", 0x1033, 0x00e0, 7, + .physfn = "0005:90:01.0"); /* Virtual Function */ MAKE_PCI_DEVICE("0000:0a:01.0", 0x8086, 0x0047, 8); MAKE_PCI_DEVICE("0000:0a:02.0", 0x8286, 0x0048, 8); MAKE_PCI_DEVICE("0000:0a:03.0", 0x8386, 0x0048, 8); -- 2.23.0

On 11/26/19 8:33 AM, Daniel Henrique Barboza wrote:
changes from previous version [1]: - do not overload address type='none'. A new address type called 'unassigned' is introduced; - there is no unassigned' flag being created in virDomainHostdevDef. The logic added by new address type is enough; - do not allow function zero of multifunction devices to be unassigned.
Nothing too special to discuss in this cover letter. More details can be found at the discussions of the previous version [1]. Commit messages of the patches have more background info as well.
[1] https://www.redhat.com/archives/libvir-list/2019-November/msg01099.html
For 1-4: Reviewed-by: Cole Robinson <crobinso@redhat.com> You asked in the last posting whether to use ADDRESS_TYPE_NONE and add a new hostdev unassigned= attribute, or this approach. I think this is the correct approach. Address type='unassigned' captures the description well, and since address type='none' isn't printed in the XML, it would make XML output a bit more confusing IMO if no address was printed. One comment
Daniel Henrique Barboza (6): Introducing new address type='unassigned' for PCI hostdevs qemu: handle unassigned PCI hostdevs in command line and alias
Is there a specific reason to skip alias assign, beside it not being needed for the command line? If it doesn't hurt, maybe we just keep it. I don't have a strong argument for it though. If no one says anything I'll leave it as is
virhostdev.c: check all IOMMU devs in virHostdevPreparePCIDevices formatdomain.html.in: document <address type='unassigned'/>
For the first 4 I'll give it a few days and push on Friday if no one complains. For the last two:
utils: PCI multifunction detection helpers domain_conf.c: don't allow function zero to be unassigned
The domain_conf.c additions should go into the virDomainHostdevDefValidate. But this validation check seems to actually read from host PCI space. I'm not sure if that's a good idea to do for every XML parse? What's the failure scenario without this error message? Does it fail in an obvious way? If so, maybe it's better to side step the issue and just let it fail if it's a valid configuration. Maybe laine knows better if there's precedent for similar checks at Validate time Thanks, Cole

On 12/11/19 8:49 PM, Cole Robinson wrote:
On 11/26/19 8:33 AM, Daniel Henrique Barboza wrote:
changes from previous version [1]: [...]
One comment
Daniel Henrique Barboza (6): Introducing new address type='unassigned' for PCI hostdevs qemu: handle unassigned PCI hostdevs in command line and alias
Is there a specific reason to skip alias assign, beside it not being needed for the command line? If it doesn't hurt, maybe we just keep it. I don't have a strong argument for it though. If no one says anything I'll leave it as is
The reason why I was skipping aliases was cosmetic due to QEMU command line. I find it a bit odd that, in a scenario with 4 functions where function 1 is unassigned, QEMU cmd line would have hostdev0, hostdev2 and hostdev3, because 'hostdev1' alias got assigned to the unassigned function '1' I don't have any strong feelings about this though. We can keep giving aliases or all functions, regardless of whether they are being assigned to the guest or not. Unit tests will need to be adjusted though.
virhostdev.c: check all IOMMU devs in virHostdevPreparePCIDevices formatdomain.html.in: document <address type='unassigned'/>
For the first 4 I'll give it a few days and push on Friday if no one complains.
For the last two:
utils: PCI multifunction detection helpers domain_conf.c: don't allow function zero to be unassigned
The domain_conf.c additions should go into the virDomainHostdevDefValidate. But this validation check seems to actually read from host PCI space. I'm not sure if that's a good idea to do for every XML parse? What's the failure scenario without this error message? Does it fail in an obvious way? If so, maybe it's better to side step the issue and just let it fail if it's a valid configuration.
This is a strange scenario TBH. I discussed it a bit with Alex Williamson in the v2 of this series. At first there is nothing wrong with this, in fact QEMU allows it. Problem is that Power guests handle this case in a "better" way than x86 and others, making the non-zero function being visible and usable by the guest without any extra kernel option (x86 needs pci_scan_all). It's also hardware dependent - the BCM5719 network card I am using for testing deals with this scenario, but there's no guarantee that other cards will. To answer your question directly: this might not fail in an obvious way in the guest, but it's not like this feature is a default PCI assignment mode - the user have to deliberately unassigned the function 0 in the XML. Granted, I am being conservative with this extra check - we can simply let the user play with the configuration and, in case it blows up, the user can simply add function 0 back to the guest. If this turns out to be a "toxic" setup then I can go to QEMU and deal with it there - I mean, in the end it's QEMU that allows this, so makes sense to handle the case down there. Thanks, DHB
Maybe laine knows better if there's precedent for similar checks at Validate time
Thanks, Cole

On 12/12/19 4:41 PM, Daniel Henrique Barboza wrote:
On 12/11/19 8:49 PM, Cole Robinson wrote:
On 11/26/19 8:33 AM, Daniel Henrique Barboza wrote:
changes from previous version [1]: [...]
One comment
Daniel Henrique Barboza (6): Introducing new address type='unassigned' for PCI hostdevs qemu: handle unassigned PCI hostdevs in command line and alias
Is there a specific reason to skip alias assign, beside it not being needed for the command line? If it doesn't hurt, maybe we just keep it. I don't have a strong argument for it though. If no one says anything I'll leave it as is
The reason why I was skipping aliases was cosmetic due to QEMU command line. I find it a bit odd that, in a scenario with 4 functions where function 1 is unassigned, QEMU cmd line would have hostdev0, hostdev2 and hostdev3, because 'hostdev1' alias got assigned to the unassigned function '1'
I don't have any strong feelings about this though. We can keep giving aliases or all functions, regardless of whether they are being assigned to the guest or not. Unit tests will need to be adjusted though.
virhostdev.c: check all IOMMU devs in virHostdevPreparePCIDevices formatdomain.html.in: document <address type='unassigned'/>
For the first 4 I'll give it a few days and push on Friday if no one complains.
For the last two:
utils: PCI multifunction detection helpers domain_conf.c: don't allow function zero to be unassigned
The domain_conf.c additions should go into the virDomainHostdevDefValidate. But this validation check seems to actually read from host PCI space. I'm not sure if that's a good idea to do for every XML parse? What's the failure scenario without this error message? Does it fail in an obvious way? If so, maybe it's better to side step the issue and just let it fail if it's a valid configuration.
This is a strange scenario TBH. I discussed it a bit with Alex Williamson in the v2 of this series. At first there is nothing wrong with this, in fact QEMU allows it. Problem is that Power guests handle this case in a "better" way than x86 and others, making the non-zero function being visible and usable by the guest without any extra kernel option (x86 needs pci_scan_all). It's also hardware dependent - the BCM5719 network card I am using for testing deals with this scenario, but there's no guarantee that other cards will.
To answer your question directly: this might not fail in an obvious way in the guest, but it's not like this feature is a default PCI assignment mode - the user have to deliberately unassigned the function 0 in the XML. Granted, I am being conservative with this extra check - we can simply let the user play with the configuration and, in case it blows up, the user can simply add function 0 back to the guest. If this turns out to be a "toxic" setup then I can go to QEMU and deal with it there - I mean, in the end it's QEMU that allows this, so makes sense to handle the case down there.
Okay I think we should drop the last two patches then. Small change of plan: Please respin this series with the alias changes dropped. Also in the docs patch, the version needs to be updated to 6.0.0, I missed that the first time. After that I will push Thanks, Cole
participants (2)
-
Cole Robinson
-
Daniel Henrique Barboza