[libvirt] [PATCH v4 0/5] PCI hostdev partial assignment support

changes from version 3 [1]: - removed last 2 patches that made function 0 of PCI multifunction devices mandatory - new patch: news.xml update - changed 'since' version to 6.0.0 in patch 4 - unassigned hostdevs are now getting qemu aliases [1] https://www.redhat.com/archives/libvir-list/2019-November/msg01263.html Daniel Henrique Barboza (5): Introducing new address type='unassigned' for PCI hostdevs qemu: handle unassigned PCI hostdevs in command line virhostdev.c: check all IOMMU devs in virHostdevPreparePCIDevices formatdomain.html.in: document <address type='unassigned'/> news.xml: add address type='unassigned' entry docs/formatdomain.html.in | 14 ++++ docs/news.xml | 19 ++++++ docs/schemas/domaincommon.rng | 5 ++ src/conf/device_conf.c | 2 + src/conf/device_conf.h | 1 + src/conf/domain_conf.c | 7 +- src/qemu/qemu_command.c | 5 ++ src/qemu/qemu_domain.c | 1 + src/qemu/qemu_domain_address.c | 5 ++ src/util/virhostdev.c | 64 +++++++++++++++++-- .../hostdev-pci-address-unassigned.args | 31 +++++++++ .../hostdev-pci-address-unassigned.xml | 42 ++++++++++++ tests/qemuxml2argvtest.c | 4 ++ .../hostdev-pci-address-unassigned.xml | 58 +++++++++++++++++ tests/qemuxml2xmltest.c | 1 + 15 files changed, 251 insertions(+), 8 deletions(-) create mode 100644 tests/qemuxml2argvdata/hostdev-pci-address-unassigned.args create mode 100644 tests/qemuxml2argvdata/hostdev-pci-address-unassigned.xml create mode 100644 tests/qemuxml2xmloutdata/hostdev-pci-address-unassigned.xml -- 2.23.0

Today, to use a PCI hostdev "A" in a domain, all PCI devices that belongs to the same IOMMU group must also be declared in the domain XML, meaning that all IOMMU devices are detached from the host and all of them are visible to the guest. The result is that the guest will have access to all devices, but this is not necessarily what the administrator wanted. If only the hostdev "A" was intended for guest usage, but hostdevs "B" and "C" happened to be in the same IOMMU group of "A", the guest will gain access to all 3 devices. This makes the administrator rely on alternative solutions, such as use all hostdevs with un-managed mode and detached all the IOMMU before the guest starts. If use un-managed mode is not an option, the only alternative left is an ACS patch to deny guest access to "B" and "C". This patch introduces a new address type called "unassigned" to handle this situation where a hostdev will be owned by a domain, but not visible to the guest OS. This allows the administrator to declare all the IOMMU while also choosing which hostdevs will be usable by the guest. This new mechanic applies to all PCI hostdevs, regardless of whether they are a PCI multifunction hostdev or not. Using <address type='unassigned'/> in any case other than a PCI hostdev will result in error. Next patch will use this new address type in the QEMU driver to avoid adding unassigned devices to the QEMU launch command line. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- docs/schemas/domaincommon.rng | 5 ++ src/conf/device_conf.c | 2 + src/conf/device_conf.h | 1 + src/conf/domain_conf.c | 7 ++- src/qemu/qemu_command.c | 1 + src/qemu/qemu_domain.c | 1 + src/qemu/qemu_domain_address.c | 5 ++ .../hostdev-pci-address-unassigned.xml | 42 ++++++++++++++ .../hostdev-pci-address-unassigned.xml | 58 +++++++++++++++++++ tests/qemuxml2xmltest.c | 1 + 10 files changed, 122 insertions(+), 1 deletion(-) create mode 100644 tests/qemuxml2argvdata/hostdev-pci-address-unassigned.xml create mode 100644 tests/qemuxml2xmloutdata/hostdev-pci-address-unassigned.xml diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index f5b51d20ad..372d248f25 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -5470,6 +5470,11 @@ </attribute> <ref name="dimmaddress"/> </group> + <group> + <attribute name="type"> + <value>unassigned</value> + </attribute> + </group> </choice> </element> </define> diff --git a/src/conf/device_conf.c b/src/conf/device_conf.c index 4c57f0995f..4dbd5c1ac9 100644 --- a/src/conf/device_conf.c +++ b/src/conf/device_conf.c @@ -45,6 +45,7 @@ VIR_ENUM_IMPL(virDomainDeviceAddress, "virtio-mmio", "isa", "dimm", + "unassigned", ); static int @@ -120,6 +121,7 @@ virDomainDeviceInfoAddressIsEqual(const virDomainDeviceInfo *a, /* address types below don't have any specific data */ case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_MMIO: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_S390: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED: break; case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_PCI: diff --git a/src/conf/device_conf.h b/src/conf/device_conf.h index d98fae775c..e091d7cfe2 100644 --- a/src/conf/device_conf.h +++ b/src/conf/device_conf.h @@ -45,6 +45,7 @@ typedef enum { VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_MMIO, VIR_DOMAIN_DEVICE_ADDRESS_TYPE_ISA, VIR_DOMAIN_DEVICE_ADDRESS_TYPE_DIMM, + VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED, VIR_DOMAIN_DEVICE_ADDRESS_TYPE_LAST } virDomainDeviceAddressType; diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index cfec86a24d..454eb63be6 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -6338,9 +6338,11 @@ virDomainHostdevDefValidate(const virDomainHostdevDef *hostdev) switch ((virDomainHostdevSubsysType) hostdev->source.subsys.type) { case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI: if (hostdev->info->type != VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE && + hostdev->info->type != VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED && hostdev->info->type != VIR_DOMAIN_DEVICE_ADDRESS_TYPE_PCI) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("PCI host devices must use 'pci' address type")); + _("PCI host devices must use 'pci' or " + "'unassigned' address type")); return -1; } break; @@ -7357,6 +7359,7 @@ virDomainDeviceInfoFormat(virBufferPtr buf, case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_S390: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_LAST: break; } @@ -7557,6 +7560,7 @@ virDomainDeviceAddressParseXML(xmlNodePtr address, break; case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_LAST: break; } @@ -21871,6 +21875,7 @@ virDomainDeviceInfoCheckABIStability(virDomainDeviceInfoPtr src, case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_CCW: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_MMIO: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_LAST: break; } diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 49a0dad8d4..5d12ef3d63 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -461,6 +461,7 @@ qemuBuildVirtioDevStr(virBufferPtr buf, return -1; case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_LAST: default: virReportEnumRangeError(virDomainDeviceAddressType, info->type); diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index df21065e84..b3df2c4a7a 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -7456,6 +7456,7 @@ qemuDomainDeviceDefValidateAddress(const virDomainDeviceDef *dev, case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_MMIO: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_ISA: case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_DIMM: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED: /* No validation for these address types yet */ break; diff --git a/src/qemu/qemu_domain_address.c b/src/qemu/qemu_domain_address.c index 605984f80f..b077240899 100644 --- a/src/qemu/qemu_domain_address.c +++ b/src/qemu/qemu_domain_address.c @@ -2313,6 +2313,11 @@ qemuDomainAssignDevicePCISlots(virDomainDefPtr def, continue; } + /* do not reserve address for info->type='unassigned' */ + if (def->hostdevs[i]->info->type == + VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED) + continue; + if (qemuDomainPCIAddressReserveNextAddr(addrs, def->hostdevs[i]->info) < 0) goto error; diff --git a/tests/qemuxml2argvdata/hostdev-pci-address-unassigned.xml b/tests/qemuxml2argvdata/hostdev-pci-address-unassigned.xml new file mode 100644 index 0000000000..9a2685ca0e --- /dev/null +++ b/tests/qemuxml2argvdata/hostdev-pci-address-unassigned.xml @@ -0,0 +1,42 @@ +<domain type='kvm'> + <name>delete</name> + <uuid>583a8e8e-f0ce-4f53-89ab-092862148b25</uuid> + <memory unit='KiB'>262144</memory> + <vcpu placement='static'>4</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + </os> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='usb' index='0'/> + <controller type='ide' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x0'/> + </source> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x1'/> + </source> + <address type='unassigned'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x2'/> + </source> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x3'/> + </source> + </hostdev> + </devices> +</domain> diff --git a/tests/qemuxml2xmloutdata/hostdev-pci-address-unassigned.xml b/tests/qemuxml2xmloutdata/hostdev-pci-address-unassigned.xml new file mode 100644 index 0000000000..2341e8432b --- /dev/null +++ b/tests/qemuxml2xmloutdata/hostdev-pci-address-unassigned.xml @@ -0,0 +1,58 @@ +<domain type='kvm'> + <name>delete</name> + <uuid>583a8e8e-f0ce-4f53-89ab-092862148b25</uuid> + <memory unit='KiB'>262144</memory> + <currentMemory unit='KiB'>262144</currentMemory> + <vcpu placement='static'>4</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='usb' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> + </controller> + <controller type='ide' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/> + </controller> + <controller type='pci' index='0' model='pci-root'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x0'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x1'/> + </source> + <address type='unassigned'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x2'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0005' bus='0x90' slot='0x01' function='0x3'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> + </hostdev> + <memballoon model='virtio'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> + </memballoon> + </devices> +</domain> diff --git a/tests/qemuxml2xmltest.c b/tests/qemuxml2xmltest.c index 50b43e01d4..c02791c660 100644 --- a/tests/qemuxml2xmltest.c +++ b/tests/qemuxml2xmltest.c @@ -431,6 +431,7 @@ mymain(void) DO_TEST("hostdev-usb-address", NONE); DO_TEST("hostdev-pci-address", NONE); + DO_TEST("hostdev-pci-address-unassigned", NONE); DO_TEST("hostdev-pci-multifunction", NONE); DO_TEST("hostdev-vfio", NONE); DO_TEST("hostdev-vfio-zpci", -- 2.23.0

Previous patch made it possible for the QEMU driver to check if a given PCI hostdev is unassigned, by checking if dev->info->type is VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED, meaning that this device shouldn't be part of the actual guest launch. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_command.c | 4 +++ .../hostdev-pci-address-unassigned.args | 31 +++++++++++++++++++ tests/qemuxml2argvtest.c | 4 +++ 3 files changed, 39 insertions(+) create mode 100644 tests/qemuxml2argvdata/hostdev-pci-address-unassigned.args diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 5d12ef3d63..7f7331744f 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -5419,6 +5419,10 @@ qemuBuildHostdevCommandLine(virCommandPtr cmd, } } + /* Ignore unassigned devices */ + if (hostdev->info->type == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED) + continue; + unsigned int bootIndex = hostdev->info->bootIndex; /* bootNet will be non-0 if boot order was set and no other diff --git a/tests/qemuxml2argvdata/hostdev-pci-address-unassigned.args b/tests/qemuxml2argvdata/hostdev-pci-address-unassigned.args new file mode 100644 index 0000000000..42fae17444 --- /dev/null +++ b/tests/qemuxml2argvdata/hostdev-pci-address-unassigned.args @@ -0,0 +1,31 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/tmp/lib/domain--1-delete \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/tmp/lib/domain--1-delete/.local/share \ +XDG_CACHE_HOME=/tmp/lib/domain--1-delete/.cache \ +XDG_CONFIG_HOME=/tmp/lib/domain--1-delete/.config \ +QEMU_AUDIO_DRV=none \ +/usr/bin/qemu-system-x86_64 \ +-name delete \ +-S \ +-machine pc,accel=kvm,usb=off,dump-guest-core=off \ +-m 256 \ +-realtime mlock=off \ +-smp 4,sockets=4,cores=1,threads=1 \ +-uuid 583a8e8e-f0ce-4f53-89ab-092862148b25 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,path=/tmp/lib/domain--1-delete/monitor.sock,\ +server,nowait \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-no-acpi \ +-usb \ +-device vfio-pci,host=0005:90:01.0,id=hostdev0,bus=pci.0,addr=0x3 \ +-device vfio-pci,host=0005:90:01.2,id=hostdev2,bus=pci.0,addr=0x4 \ +-device vfio-pci,host=0005:90:01.3,id=hostdev3,bus=pci.0,addr=0x5 \ +-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index 8655db609e..92fbf3764e 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -1331,6 +1331,10 @@ mymain(void) QEMU_CAPS_KVM, QEMU_CAPS_DEVICE_VFIO_PCI); + DO_TEST("hostdev-pci-address-unassigned", + QEMU_CAPS_KVM, + QEMU_CAPS_DEVICE_VFIO_PCI); + DO_TEST("serial-file-log", QEMU_CAPS_CHARDEV_FILE_APPEND, QEMU_CAPS_DEVICE_ISA_SERIAL, -- 2.23.0

virHostdevPreparePCIDevices verifies if a PCI hostdev "A" is in use by another domain by checking the 'activePCIHostdevs' list. It also verifies if other domains are using any other hostdev that belongs to the same IOMMU of "A". This is not enough to cover all the cases. Suppose a PCI hostdev "B" that shares the IOMMU with "A". We are currently able to abort guest launch if "B" is being used by another domain, but if "B" is not being used by any domain, we'll proceed guest execution. What happens then is: if "A" is being used in managed mode, the guest will fail to launch because Libvirt didn't detach "B" from the host. If "A" is being used in un-managed mode, the guest might fail or succeed to launch depending on whether both "A" and "B" were detached manually prior to guest start. Libvirt is now able to deal with these scenarios in a homogeneous fashion, providing the same behavior for both managed and un-managed mode while allowing parcial assignment for both cases as well. This patch changes virHostdevPreparePCIDevices to check all the PCI hostdevs of the domain against all the PCI hostdevs of the IOMMU, failing to launch if the domain does not contain all the PCI hostdevs declared, in both managed and un-managed cases. After this patch, any existing domains that were using PCI hostdevs with un-managed mode, and were getting away with partial assignment without declaring all the IOMMU PCI hostdevs in the domain XML, will be forced to do so. The missing PCI hostdevs can be declared with "<address type='unassigned'/>" to keep them invisible to the guest, making the guest oblivious of this change. This is an annoyance for such domains, but the payoff is more consistency of behavior between managed and un-managed mode and more clarity on the domain XML, forcing the admin to declare all PCI hostdevs of the IOMMU and knowing exactly what will be detached from the host. This is an example of this newly added error condition when the domain does not declare all the PCI hostdevs of the same IOMMU: $ sudo ./run tools/virsh start multipci-coldplug-partial-fail error: Failed to start domain multipci-coldplug-partial-fail error: Requested operation is not valid: All devices of the same IOMMU group 1 of the PCI device 0001:09:00.0 must belong to domain multipci-coldplug-partial-fail $ Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/util/virhostdev.c | 64 ++++++++++++++++++++++++++++++++++++++----- 1 file changed, 57 insertions(+), 7 deletions(-) diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index 39e6b8f49f..2e7885112f 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -56,6 +56,13 @@ struct virHostdevIsPCINodeDeviceUsedData { bool usesVFIO; }; +struct virHostdevIsAllIOMMUGroupUsedData { + virPCIDeviceListPtr pcidevs; + const char *domainName; + const char *deviceName; + int iommuGroup; +}; + /* This module makes heavy use of bookkeeping lists contained inside a * virHostdevManager instance to keep track of the devices' status. To make * it easy to spot potential ownership errors when moving devices from one @@ -111,6 +118,27 @@ static int virHostdevIsPCINodeDeviceUsed(virPCIDeviceAddressPtr devAddr, void *o return 0; } +static int virHostdevIsAllIOMMUGroupUsed(virPCIDeviceAddressPtr devAddr, void *opaque) +{ + struct virHostdevIsAllIOMMUGroupUsedData *helperData = opaque; + virPCIDevicePtr actual; + + actual = virPCIDeviceListFindByIDs(helperData->pcidevs, + devAddr->domain, devAddr->bus, + devAddr->slot, devAddr->function); + if (actual) { + return 0; + } else { + virReportError(VIR_ERR_OPERATION_INVALID, + _("All devices of the same IOMMU group %d of " + "the PCI device %s must belong to domain %s"), + helperData->iommuGroup, + helperData->deviceName, + helperData->domainName); + return -1; + } +} + static int virHostdevManagerOnceInit(void) { if (!VIR_CLASS_NEW(virHostdevManager, virClassForObject())) @@ -724,9 +752,10 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, unsigned int flags) { g_autoptr(virPCIDeviceList) pcidevs = NULL; + g_autofree unsigned int *searchedIOMMUs = NULL; int last_processed_hostdev_vf = -1; - size_t i; - int ret = -1; + size_t i, j; + int ret = -1, nSearchedIOMMUs = 0; virPCIDeviceAddressPtr devAddr = NULL; if (!nhostdevs) @@ -735,6 +764,8 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, if (!(pcidevs = virHostdevGetPCIHostDeviceList(hostdevs, nhostdevs))) return -1; + searchedIOMMUs = g_new0(unsigned int, virPCIDeviceListCount(pcidevs)); + virObjectLock(mgr->activePCIHostdevs); virObjectLock(mgr->inactivePCIHostdevs); @@ -778,13 +809,32 @@ virHostdevPreparePCIDevices(virHostdevManagerPtr mgr, goto cleanup; /* VFIO devices belonging to same IOMMU group can't be - * shared across guests. Check if that's the case. */ + * shared across guests, and all of the must belong to the + * same domain. If a device isn't going to be assigned + * (flag unassigned is true) the guest will not use it, but + * the actual device must be detached together from the host + * anyway. */ if (usesVFIO) { - data.usesVFIO = true; - if (virPCIDeviceAddressIOMMUGroupIterate(devAddr, - virHostdevIsPCINodeDeviceUsed, - &data) < 0) + int devIOMMUGroup = virPCIDeviceAddressGetIOMMUGroupNum(devAddr); + struct virHostdevIsAllIOMMUGroupUsedData helper = { + pcidevs, dom_name, virPCIDeviceGetName(pci), devIOMMUGroup}; + bool alreadySearched = false; + + for (j = 0; j < nSearchedIOMMUs; j++) { + if (devIOMMUGroup == searchedIOMMUs[j]) { + alreadySearched = true; + break; + } + } + + if (alreadySearched) + continue; + + if (virPCIDeviceAddressIOMMUGroupIterate( + devAddr, virHostdevIsAllIOMMUGroupUsed, &helper) < 0) goto cleanup; + + searchedIOMMUs[nSearchedIOMMUs++] = devIOMMUGroup; } } -- 2.23.0

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- docs/formatdomain.html.in | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index bfcdc026e6..281de89b10 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -4152,6 +4152,20 @@ attributes: <code>iobase</code> and <code>irq</code>. <span class="since">Since 1.2.1</span> </dd> + <dt><code>unassigned</code></dt> + <dd>For PCI hostdevs, <code><address type='unassigned'/></code> + has a special meaning. A PCI hostdev must be assigned to the domain + with all the devices of its IOMMU group. If the admin so chooses, + it is possible to filter which devices the guest OS will actually + see, while the Libvirt domain is retaining ownership of all the + devices of the IOMMU group. Using this address type in such cases + will tell Libvirt that the device must not be assigned to the guest. + This avoids scenarios in which the admin might be compelled to use + a ACS patch to remove the device from the guest. + <code><address type='unassigned'/></code> is an invalid address + type for all other device types. + <span class="since">Since 6.0.0</span> + </dd> </dl> <h4><a id="elementsVirtio">Virtio-related options</a></h4> -- 2.23.0

Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- docs/news.xml | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/docs/news.xml b/docs/news.xml index 2a25b6ca49..e77cd743c8 100644 --- a/docs/news.xml +++ b/docs/news.xml @@ -54,6 +54,25 @@ written in the RST as an alternative to HTML. </description> </change> + <change> + <summary> + new PCI hostdev address type: unassigned + </summary> + <description> + A new PCI hostdev address type 'unassigned' is introduced, + giving users the option to choose which PCI hostdevs + within the same IOMMU group will not be assigned to the + guest. After this change, all PCI hostdevs of the same IOMMU + must be declared in the domain XML. Hostdevs that shouldn't + be used by the guest can be classified as + <code>address type='unassigned'</code>. Forcing all IOMMU + hostdevs to be declared makes the user aware of what will + be detached from the host, and this new address type gives + the user an extra layer of access control of hardware + resources, denying guest access to PCI hostdevs that + it shouldn't have access to. + </description> + </change> </section> <section title="Improvements"> <change> -- 2.23.0

On 12/16/19 8:36 AM, Daniel Henrique Barboza wrote:
changes from version 3 [1]: - removed last 2 patches that made function 0 of PCI multifunction devices mandatory - new patch: news.xml update - changed 'since' version to 6.0.0 in patch 4 - unassigned hostdevs are now getting qemu aliases
[1] https://www.redhat.com/archives/libvir-list/2019-November/msg01263.html
Daniel Henrique Barboza (5): Introducing new address type='unassigned' for PCI hostdevs qemu: handle unassigned PCI hostdevs in command line virhostdev.c: check all IOMMU devs in virHostdevPreparePCIDevices formatdomain.html.in: document <address type='unassigned'/> news.xml: add address type='unassigned' entry
Codewise it looks fine now. But I'm looking more closely at patch #3 and realizing that it can explicitly reject a previously accepted VM config. And indeed, now that I give it a test with my GPU passthrough setup, it is rejecting my previosly working config. error: Requested operation is not valid: All devices of the same IOMMU group 1 of the PCI device 0000:01:00.0 must belong to domain win10 I've attached the nodedev XML for the three devices with iommuGroup 1. Only the two nvidia devices are assigned to my VM, but not the PCIe controller device. Is the libvirt heuristic missing something? Or is this acting as expected? I didn't quite gather that this is a change to reject previously accepted configurations, so I will defer to Laine and Alex as to whether this should be committed. - Cole

On 12/16/19 7:28 PM, Cole Robinson wrote:
On 12/16/19 8:36 AM, Daniel Henrique Barboza wrote:
changes from version 3 [1]: - removed last 2 patches that made function 0 of PCI multifunction devices mandatory - new patch: news.xml update - changed 'since' version to 6.0.0 in patch 4 - unassigned hostdevs are now getting qemu aliases
[1] https://www.redhat.com/archives/libvir-list/2019-November/msg01263.html
Daniel Henrique Barboza (5): Introducing new address type='unassigned' for PCI hostdevs qemu: handle unassigned PCI hostdevs in command line virhostdev.c: check all IOMMU devs in virHostdevPreparePCIDevices formatdomain.html.in: document <address type='unassigned'/> news.xml: add address type='unassigned' entry
Codewise it looks fine now. But I'm looking more closely at patch #3 and realizing that it can explicitly reject a previously accepted VM config. And indeed, now that I give it a test with my GPU passthrough setup, it is rejecting my previosly working config.
error: Requested operation is not valid: All devices of the same IOMMU group 1 of the PCI device 0000:01:00.0 must belong to domain win10
I've attached the nodedev XML for the three devices with iommuGroup 1. Only the two nvidia devices are assigned to my VM, but not the PCIe controller device.
Is the libvirt heuristic missing something? Or is this acting as expected?
You mentioned that you declared 3 devices of IOMMU group 1. Unless the code in patch 3 has a bug, there are more PCI hostdevs in IOMMU group 1 that were left out of the domain XML.
I didn't quite gather that this is a change to reject previously accepted configurations, so I will defer to Laine and Alex as to whether this should be committed.
I mentioned in the commit msg of patch 03 that this would break working configurations that didn't comply with the new 'all devices of the IOMMU group must be included in the domain XML' directive. Perhaps this is worth mentioning in the 'news' page to warn users about it. About breaking existing configurations, there is the possibility of not going forward with patch 03, which is enforcing this rule of declaring all the IOMMU group. Existing domains will keep working as usual, the option to unassign devices will still be present, but the user will have to deal with the potential QEMU errors if not all PCI devices were detached from the host. In this case, the 'unassigned' type will become more of a ON/OFF switch to add/remove the PCI hostdev from the guest without removing it from the domain XML. It is still useful, but we lose the idea of all the IOMMU devices being described in the domain XML, which is something Laine mentioned it would be desirable in one of the RFCs. Thanks, DHB
- Cole

On 12/16/19 6:03 PM, Daniel Henrique Barboza wrote:
On 12/16/19 7:28 PM, Cole Robinson wrote:
On 12/16/19 8:36 AM, Daniel Henrique Barboza wrote:
changes from version 3 [1]: - removed last 2 patches that made function 0 of PCI multifunction devices mandatory - new patch: news.xml update - changed 'since' version to 6.0.0 in patch 4 - unassigned hostdevs are now getting qemu aliases
[1] https://www.redhat.com/archives/libvir-list/2019-November/msg01263.html
Daniel Henrique Barboza (5): Introducing new address type='unassigned' for PCI hostdevs qemu: handle unassigned PCI hostdevs in command line virhostdev.c: check all IOMMU devs in virHostdevPreparePCIDevices formatdomain.html.in: document <address type='unassigned'/> news.xml: add address type='unassigned' entry
Codewise it looks fine now. But I'm looking more closely at patch #3 and realizing that it can explicitly reject a previously accepted VM config. And indeed, now that I give it a test with my GPU passthrough setup, it is rejecting my previosly working config.
error: Requested operation is not valid: All devices of the same IOMMU group 1 of the PCI device 0000:01:00.0 must belong to domain win10
I've attached the nodedev XML for the three devices with iommuGroup 1. Only the two nvidia devices are assigned to my VM, but not the PCIe controller device.
Is the libvirt heuristic missing something? Or is this acting as expected?
You mentioned that you declared 3 devices of IOMMU group 1. Unless the code in patch 3 has a bug, there are more PCI hostdevs in IOMMU group 1 that were left out of the domain XML.
I didn't quite gather that this is a change to reject previously accepted configurations, so I will defer to Laine and Alex as to whether this should be committed.
I mentioned in the commit msg of patch 03 that this would break working configurations that didn't comply with the new 'all devices of the IOMMU group must be included in the domain XML' directive. Perhaps this is worth mentioning in the 'news' page to warn users about it.
No, this shouldn't be a requirement at all. In my mind the purpose of these patches is to make something work (in a safe manner) that failed before, *not* to add new restrictions that break things that already work. (Sorry I wasn't paying more attention to the patches earlier).
About breaking existing configurations, there is the possibility of not going forward with patch 03, which is enforcing this rule of declaring all the IOMMU group. Existing domains will keep working as usual, the option to unassign devices will still be present, but the user will have to deal with the potential QEMU errors if not all PCI devices were detached from the host.
In this case, the 'unassigned' type will become more of a ON/OFF switch to add/remove the PCI hostdev from the guest without removing it from the domain XML. It is still useful, but we lose the idea of all the IOMMU devices being described in the domain XML, which is something Laine mentioned it would be desirable in one of the RFCs.
I don't actually recall saying that :-). I haven't looked in the list archives, but what I *can* imagine myself saying is that only devices mentioned in the XML should be manipulated in any way by libvirt. So, for example, you shouldn't unbind device X from its host driver if there is nothing in the XML telling you to do that. But if a device isn't mentioned in the XML, and is already bound to some driver that is acceptable to the VFIO subsystem (e.g. vfio-pci, pci-stub or no driver at all (? is that right Alex?)) then that should not create any problem. Doing otherwise would break too many existing configs. (For example, my own assigned-GPU config, which assumes that all the devices are already bound to the proper driver, and uses "managed='no'")

On Tue, 17 Dec 2019 11:25:38 -0500 Laine Stump <laine@laine.org> wrote:
On 12/16/19 6:03 PM, Daniel Henrique Barboza wrote:
On 12/16/19 7:28 PM, Cole Robinson wrote:
On 12/16/19 8:36 AM, Daniel Henrique Barboza wrote:
changes from version 3 [1]: - removed last 2 patches that made function 0 of PCI multifunction devices mandatory - new patch: news.xml update - changed 'since' version to 6.0.0 in patch 4 - unassigned hostdevs are now getting qemu aliases
[1] https://www.redhat.com/archives/libvir-list/2019-November/msg01263.html
Daniel Henrique Barboza (5): Introducing new address type='unassigned' for PCI hostdevs qemu: handle unassigned PCI hostdevs in command line virhostdev.c: check all IOMMU devs in virHostdevPreparePCIDevices formatdomain.html.in: document <address type='unassigned'/> news.xml: add address type='unassigned' entry
Codewise it looks fine now. But I'm looking more closely at patch #3 and realizing that it can explicitly reject a previously accepted VM config. And indeed, now that I give it a test with my GPU passthrough setup, it is rejecting my previosly working config.
error: Requested operation is not valid: All devices of the same IOMMU group 1 of the PCI device 0000:01:00.0 must belong to domain win10
I've attached the nodedev XML for the three devices with iommuGroup 1. Only the two nvidia devices are assigned to my VM, but not the PCIe controller device.
Is the libvirt heuristic missing something? Or is this acting as expected?
You mentioned that you declared 3 devices of IOMMU group 1. Unless the code in patch 3 has a bug, there are more PCI hostdevs in IOMMU group 1 that were left out of the domain XML.
I didn't quite gather that this is a change to reject previously accepted configurations, so I will defer to Laine and Alex as to whether this should be committed.
I mentioned in the commit msg of patch 03 that this would break working configurations that didn't comply with the new 'all devices of the IOMMU group must be included in the domain XML' directive. Perhaps this is worth mentioning in the 'news' page to warn users about it.
No, this shouldn't be a requirement at all. In my mind the purpose of these patches is to make something work (in a safe manner) that failed before, *not* to add new restrictions that break things that already work. (Sorry I wasn't paying more attention to the patches earlier).
+1
About breaking existing configurations, there is the possibility of not going forward with patch 03, which is enforcing this rule of declaring all the IOMMU group. Existing domains will keep working as usual, the option to unassign devices will still be present, but the user will have to deal with the potential QEMU errors if not all PCI devices were detached from the host.
In this case, the 'unassigned' type will become more of a ON/OFF switch to add/remove the PCI hostdev from the guest without removing it from the domain XML. It is still useful, but we lose the idea of all the IOMMU devices being described in the domain XML, which is something Laine mentioned it would be desirable in one of the RFCs.
I don't actually recall saying that :-). I haven't looked in the list archives, but what I *can* imagine myself saying is that only devices mentioned in the XML should be manipulated in any way by libvirt. So,
+1
for example, you shouldn't unbind device X from its host driver if there is nothing in the XML telling you to do that. But if a device isn't mentioned in the XML, and is already bound to some driver that is acceptable to the VFIO subsystem (e.g. vfio-pci, pci-stub or no driver at all (? is that right Alex?)) then that should not create any problem.
Yes, that's right.
Doing otherwise would break too many existing configs. (For example, my own assigned-GPU config, which assumes that all the devices are already bound to the proper driver, and uses "managed='no'")
Effectively anyone assigning PFs with a PCIe root port that does not support ACS would be broken by this series. That's a significant portion of vfio users. Thanks, Alex

On 12/17/19 1:32 PM, Alex Williamson wrote:
On Tue, 17 Dec 2019 11:25:38 -0500 Laine Stump <laine@laine.org> wrote:
On 12/16/19 6:03 PM, Daniel Henrique Barboza wrote:
About breaking existing configurations, there is the possibility of not going forward with patch 03, which is enforcing this rule of declaring all the IOMMU group. Existing domains will keep working as usual, the option to unassign devices will still be present, but the user will have to deal with the potential QEMU errors if not all PCI devices were detached from the host.
In this case, the 'unassigned' type will become more of a ON/OFF switch to add/remove the PCI hostdev from the guest without removing it from the domain XML. It is still useful, but we lose the idea of all the IOMMU devices being described in the domain XML, which is something Laine mentioned it would be desirable in one of the RFCs.
I don't actually recall saying that :-). I haven't looked in the list archives, but what I *can* imagine myself saying is that only devices mentioned in the XML should be manipulated in any way by libvirt. So,
+1
for example, you shouldn't unbind device X from its host driver if there is nothing in the XML telling you to do that. But if a device isn't mentioned in the XML, and is already bound to some driver that is acceptable to the VFIO subsystem (e.g. vfio-pci, pci-stub or no driver at all (? is that right Alex?)) then that should not create any problem.
Yes, that's right.
Doing otherwise would break too many existing configs. (For example, my own assigned-GPU config, which assumes that all the devices are already bound to the proper driver, and uses "managed='no'")
This is the new approach of the series I implemented today and plan to to send for review today/tomorrow. I realized, after all the discussions yesterday with Alex, that the patch series would be best just sticking with what we want fixed (managed=yes and parcial assignment) and leaving unmanaged (managed=no) configurations alone. If the user has an existing, working unmanaged setup, this means that the user chose to manage device detach/re-attach manually and shouldn't be bothered with a change that's aimed to managed devices. Thanks, DHB
Effectively anyone assigning PFs with a PCIe root port that does not support ACS would be broken by this series. That's a significant portion of vfio users. Thanks,
Alex

On Tue, 17 Dec 2019 13:43:14 -0300 Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
On 12/17/19 1:32 PM, Alex Williamson wrote:
On Tue, 17 Dec 2019 11:25:38 -0500 Laine Stump <laine@laine.org> wrote:
On 12/16/19 6:03 PM, Daniel Henrique Barboza wrote:
About breaking existing configurations, there is the possibility of not going forward with patch 03, which is enforcing this rule of declaring all the IOMMU group. Existing domains will keep working as usual, the option to unassign devices will still be present, but the user will have to deal with the potential QEMU errors if not all PCI devices were detached from the host.
In this case, the 'unassigned' type will become more of a ON/OFF switch to add/remove the PCI hostdev from the guest without removing it from the domain XML. It is still useful, but we lose the idea of all the IOMMU devices being described in the domain XML, which is something Laine mentioned it would be desirable in one of the RFCs.
I don't actually recall saying that :-). I haven't looked in the list archives, but what I *can* imagine myself saying is that only devices mentioned in the XML should be manipulated in any way by libvirt. So,
+1
for example, you shouldn't unbind device X from its host driver if there is nothing in the XML telling you to do that. But if a device isn't mentioned in the XML, and is already bound to some driver that is acceptable to the VFIO subsystem (e.g. vfio-pci, pci-stub or no driver at all (? is that right Alex?)) then that should not create any problem.
Yes, that's right.
Doing otherwise would break too many existing configs. (For example, my own assigned-GPU config, which assumes that all the devices are already bound to the proper driver, and uses "managed='no'")
This is the new approach of the series I implemented today and plan to to send for review today/tomorrow. I realized, after all the discussions yesterday with Alex, that the patch series would be best just sticking with what we want fixed (managed=yes and parcial assignment) and leaving unmanaged (managed=no) configurations alone. If the user has an existing, working unmanaged setup, this means that the user chose to manage device detach/re-attach manually and shouldn't be bothered with a change that's aimed to managed devices.
There are of course existing, working managed=yes configurations where the set of assigned devices is only a subset of the IOMMU group, and the user has configured other means to make the group viable relative to vfio. The statement above doesn't convince me that the next iteration isn't simply going to restrict its manipulation of other devices. As Laine says above and I said in my previous reply, libvirt should not manipulate the driver binding of any devices not explicitly listed in the domain XMl. Thanks, Alex

On 12/17/19 1:43 PM, Daniel Henrique Barboza wrote:
I don't actually recall saying that :-). I haven't looked in the list archives, but what I *can* imagine myself saying is that only devices mentioned in the XML should be manipulated in any way by libvirt. So,
+1
for example, you shouldn't unbind device X from its host driver if there is nothing in the XML telling you to do that. But if a device isn't mentioned in the XML, and is already bound to some driver that is acceptable to the VFIO subsystem (e.g. vfio-pci, pci-stub or no driver at all (? is that right Alex?)) then that should not create any problem.
Yes, that's right.
Doing otherwise would break too many existing configs. (For example, my own assigned-GPU config, which assumes that all the devices are already bound to the proper driver, and uses "managed='no'")
I am re-reading this info and reassessing what I intended to do. Suppose a scenario in which host IOMMU has PCI devices A,B and C. Let's also suppose that all of them are already bind to vfio-pci. If I create a guest with A and B as PCI hostdevs, with managed=yes, using vfio-pci, the setup would work as it is. If I put the IOMMU validation I was planning to, it will stop working because "C" isn't described in the domain XML. If we stick with the directive "Libvirt shouldn't tamper with devices that are not described in the domain XML" that Laine mentioned up there, and this directive alone, then I can't make such assumptions about whether the user will be OK with assigning everything to vfio-pci, given that there are other acceptable alternatives for the VFIO subsystem that aren't restricted to that. I'm convinced that this validation I was pursuing here is a bad idea. It has a great potential of being annoying, making assumptions about real life configurations that aren't true, and offering not much in return. I'll remove it from the series. The result is that the user will have a new option to let Libvirt control all the IOMMU devices, making some of the unassignable to the guest. But it will be a new option, not a new rule that all existing domains will need to adhere to. Thanks, DHB

On Mon, 16 Dec 2019 17:28:28 -0500 Cole Robinson <crobinso@redhat.com> wrote:
On 12/16/19 8:36 AM, Daniel Henrique Barboza wrote:
changes from version 3 [1]: - removed last 2 patches that made function 0 of PCI multifunction devices mandatory - new patch: news.xml update - changed 'since' version to 6.0.0 in patch 4 - unassigned hostdevs are now getting qemu aliases
[1] https://www.redhat.com/archives/libvir-list/2019-November/msg01263.html
Daniel Henrique Barboza (5): Introducing new address type='unassigned' for PCI hostdevs qemu: handle unassigned PCI hostdevs in command line virhostdev.c: check all IOMMU devs in virHostdevPreparePCIDevices formatdomain.html.in: document <address type='unassigned'/> news.xml: add address type='unassigned' entry
Codewise it looks fine now. But I'm looking more closely at patch #3 and realizing that it can explicitly reject a previously accepted VM config. And indeed, now that I give it a test with my GPU passthrough setup, it is rejecting my previosly working config.
error: Requested operation is not valid: All devices of the same IOMMU group 1 of the PCI device 0000:01:00.0 must belong to domain win10
I've attached the nodedev XML for the three devices with iommuGroup 1. Only the two nvidia devices are assigned to my VM, but not the PCIe controller device.
Is the libvirt heuristic missing something? Or is this acting as expected?
I didn't quite gather that this is a change to reject previously accepted configurations, so I will defer to Laine and Alex as to whether this should be committed.
Thanks for catching this! PCIe root ports or bridges being part of an IOMMU group is part of the nature of the grouping. However, only endpoint devices can be bound to vfio-pci and thus participate in this "partial assignment". If the code is trying to force all other devices in the IOMMU group that aren't assigned into this partial assignment mode, that's just fundamentally broken. Thanks, Alex

On 12/16/19 8:06 PM, Alex Williamson wrote:
On Mon, 16 Dec 2019 17:28:28 -0500 Cole Robinson <crobinso@redhat.com> wrote:
On 12/16/19 8:36 AM, Daniel Henrique Barboza wrote:
changes from version 3 [1]:
Thanks for catching this! PCIe root ports or bridges being part of an IOMMU group is part of the nature of the grouping. However, only endpoint devices can be bound to vfio-pci and thus participate in this "partial assignment". If the code is trying to force all other devices in the IOMMU group that aren't assigned into this partial assignment mode, that's just fundamentally broken. Thanks,
The code isn't forcing a device to be assigned to the guest. It is forcing all the IOMMU devices to be declared in the domain XML to be detached from the host. What I did was to extend a verification Libvirt already does, to check for PCI devices of the same IOMMU X being used by other domains, to check the the host as well. Guest start fails if there is any device left in IOMMU X that's not present in the domain. In short, the code is implying that all IOMMU devices must be detached from the host, regardless of whether they're going to be used in the guest, regardless of whether they are PCI root ports or bridges. Is this assumption correct, considering kernel/QEMU? Thanks, DHB
Alex

On Mon, 16 Dec 2019 20:24:56 -0300 Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
On 12/16/19 8:06 PM, Alex Williamson wrote:
On Mon, 16 Dec 2019 17:28:28 -0500 Cole Robinson <crobinso@redhat.com> wrote:
On 12/16/19 8:36 AM, Daniel Henrique Barboza wrote:
changes from version 3 [1]:
Thanks for catching this! PCIe root ports or bridges being part of an IOMMU group is part of the nature of the grouping. However, only endpoint devices can be bound to vfio-pci and thus participate in this "partial assignment". If the code is trying to force all other devices in the IOMMU group that aren't assigned into this partial assignment mode, that's just fundamentally broken. Thanks,
The code isn't forcing a device to be assigned to the guest. It is forcing all the IOMMU devices to be declared in the domain XML to be detached from the host.
Detached from the host by unbinding from host drivers and binding to vfio-pci and "partially" assigned to the guest? That's wrong, we can't do that. Not only will vfio-pci not bind to anything but endpoints, you'll break the host binding bridges that might be part of the group, and there are valid use cases for sequestering a device with pci-stub rather than vfio-pci to add another barrier to the user getting access to the device.
What I did was to extend a verification Libvirt already does, to check for PCI devices of the same IOMMU X being used by other domains, to check the the host as well. Guest start fails if there is any device left in IOMMU X that's not present in the domain.
Yep, can't do that.
In short, the code is implying that all IOMMU devices must be detached from the host, regardless of whether they're going to be used in the guest, regardless of whether they are PCI root ports or bridges. Is this assumption correct, considering kernel/QEMU?
Nope, please don't do this. Thanks, Alex

On 12/16/19 8:43 PM, Alex Williamson wrote:
On Mon, 16 Dec 2019 20:24:56 -0300 Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
The code isn't forcing a device to be assigned to the guest. It is forcing all the IOMMU devices to be declared in the domain XML to be detached from the host.
Detached from the host by unbinding from host drivers and binding to vfio-pci and "partially" assigned to the guest? That's wrong, we can't do that. Not only will vfio-pci not bind to anything but endpoints, you'll break the host binding bridges that might be part of the group, and there are valid use cases for sequestering a device with pci-stub rather than vfio-pci to add another barrier to the user getting access to the device.
What I did was to extend a verification Libvirt already does, to check for PCI devices of the same IOMMU X being used by other domains, to check the the host as well. Guest start fails if there is any device left in IOMMU X that's not present in the domain.
Yep, can't do that.
Thanks for the info. To keep the discussion focused, this is the error I'm trying to dodge: error: internal error: qemu unexpectedly closed the monitor: 2019-10-04T12:39:41.091312Z qemu-system-ppc64: -device vfio-pci,host=0001:09:00.3,id=hostdev0,bus=pci.2.0,addr=0x1.0x3: vfio 0001:09:00.3: group 1 is not viable Please ensure all devices within the iommu_group are bound to their vfio bus driver. This happens when not all PCI devices from IOMMU group 1 are bind to vfio_pci, regardless of whether QEMU is going to use all of them in the guest. Binding all the IOMMU devices to vfio-pci makes QEMU satisfied, in this particular case. What is the minimal condition to avoid this error? What Libvirt is doing ATM is not enough (it will fail to launch with this QEMU error above), and what I'm proposing is wrong. Can we say that all PCI endpoints of the same IOMMU must be assigned to vfio-pci? Thanks, DHB

On Mon, 16 Dec 2019 21:09:20 -0300 Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
On 12/16/19 8:43 PM, Alex Williamson wrote:
On Mon, 16 Dec 2019 20:24:56 -0300 Daniel Henrique Barboza <danielhb413@gmail.com> wrote:
The code isn't forcing a device to be assigned to the guest. It is forcing all the IOMMU devices to be declared in the domain XML to be detached from the host.
Detached from the host by unbinding from host drivers and binding to vfio-pci and "partially" assigned to the guest? That's wrong, we can't do that. Not only will vfio-pci not bind to anything but endpoints, you'll break the host binding bridges that might be part of the group, and there are valid use cases for sequestering a device with pci-stub rather than vfio-pci to add another barrier to the user getting access to the device.
What I did was to extend a verification Libvirt already does, to check for PCI devices of the same IOMMU X being used by other domains, to check the the host as well. Guest start fails if there is any device left in IOMMU X that's not present in the domain.
Yep, can't do that.
Thanks for the info.
To keep the discussion focused, this is the error I'm trying to dodge:
error: internal error: qemu unexpectedly closed the monitor: 2019-10-04T12:39:41.091312Z qemu-system-ppc64: -device vfio-pci,host=0001:09:00.3,id=hostdev0,bus=pci.2.0,addr=0x1.0x3: vfio 0001:09:00.3: group 1 is not viable Please ensure all devices within the iommu_group are bound to their vfio bus driver.
This happens when not all PCI devices from IOMMU group 1 are bind to vfio_pci, regardless of whether QEMU is going to use all of them in the guest. Binding all the IOMMU devices to vfio-pci makes QEMU satisfied, in this particular case.
What is the minimal condition to avoid this error? What Libvirt is doing ATM is not enough (it will fail to launch with this QEMU error above), and what I'm proposing is wrong. Can we say that all PCI endpoints of the same IOMMU must be assigned to vfio-pci?
Yes, but libvirt should not assume that it can manipulate the bindings of adjacent devices without being explicitly directed to do so. The error may be a hindrance to you, but it might also prevent, for example, the only other NIC in the system being detached from the host driver. Is it worth making the VM run without explicitly listing all devices to assign at the cost of disrupting host services or subverting the additional isolation a user might be attempting to configure with having unused devices bound to vfio-pci. This seems like a bad idea, the VM should be configured to explicitly list every device it needs to have assigned or partially assigned. Thanks, Alex

On 12/16/19 9:48 PM, Alex Williamson wrote:
On Mon, 16 Dec 2019 21:09:20 -0300
Yes, but libvirt should not assume that it can manipulate the bindings of adjacent devices without being explicitly directed to do so. The error may be a hindrance to you, but it might also prevent, for example, the only other NIC in the system being detached from the host driver. Is it worth making the VM run without explicitly listing all devices to assign at the cost of disrupting host services or subverting the additional isolation a user might be attempting to configure with having unused devices bound to vfio-pci. This seems like a bad idea, the VM should be configured to explicitly list every device it needs to have assigned or partially assigned. Thanks,
Thanks Alex. I know what's going wrong with this patch series after your messages and a close inspection of what Libvirt is already doing. Libvirt does a sanity check for PCI endpoint devices before assigning them to vfio-pci, but the new detection code I am adding isn't. The result is that Cole can't run his VM because this new detection code is demanding that a PCI Bridge, that belongs to the same IOMMU of the devices Cole wants to passthrough, be assigned to vfio-pci. Which is wrong. I'll re-send this series fixing that. Thanks, DHB
Alex

On 12/16/19 7:28 PM, Cole Robinson wrote:
On 12/16/19 8:36 AM, Daniel Henrique Barboza wrote:
I've attached the nodedev XML for the three devices with iommuGroup 1. Only the two nvidia devices are assigned to my VM, but not the PCIe controller device.
Just noticed the attached file. This might indicate that there's a bug in patch 3 that is wrongly assuming that there are more IOMMU devices in group 1 than the ones you declared. I'll look into it. The error message could also be more helpful, declaring which PCI hostdev it thinks it should be declared in the XML.
participants (4)
-
Alex Williamson
-
Cole Robinson
-
Daniel Henrique Barboza
-
Laine Stump