[RFC PATCH 0/5] cover letter: qemu: Implement support for iommufd and multiple vSMMUs

Hi, This is a follow up to the second RFC patchset [0] for supporting multiple vSMMU instances and using iommufd to propagate DMA mappings to kernel for VM-assigned host devices in a qemu VM. This patchset implements support for specifying multiple <iommu> devices within the VM definition when smmuv3Dev IOMMU model is specified, and is tested with Shameer's latest qemu RFC for HW-accelerated vSMMU devices [1] Moreover, it adds a new 'iommufdId' attribute for hostdev devices to be associated with the iommufd object. For instance, specifying the iommufd object and associated hostdev in a VM definition with multiple IOMMUs, configured to be routed to pcie-expander-bus controllers in a way where VFIO device to SMMUv3 associations are matched with the host: <devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='252'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='2' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='248'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> ... <controller type='pci' index='21' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='21' port='0x0'/> <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </controller> <controller type='pci' index='22' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='22' port='0xa8'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </controller> ... <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> <iommufdId>iommufd0</iommufdId> <address type='pci' domain='0x0000' bus='0x15' slot='0x00' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0019' bus='0x01' slot='0x00' function='0x0'/> </source> <iommufdId>iommufd0</iommufdId> <address type='pci' domain='0x0000' bus='0x16' slot='0x00' function='0x0'/> </hostdev> <iommu model='smmuv3Dev' parentIdx='1' accel='on'/> <iommu model='smmuv3Dev' parentIdx='2' accel='on'/> </devices> This would get translated to a qemu command line with the arguments below. Note that libvirt will open the /dev/iommu and VFIO cdev, passing the associated fd number to qemu: -device '{"driver":"pxb-pcie","bus_nr":252,"id":"pci.1","bus":"pcie.0","addr":"0x1"}' \ -device '{"driver":"pxb-pcie","bus_nr":248,"id":"pci.2","bus":"pcie.0","addr":"0x2"}' \ -device '{"driver":"pcie-root-port","port":0,"chassis":21,"id":"pci.21","bus":"pci.1","addr":"0x0"}' \ -device '{"driver":"pcie-root-port","port":168,"chassis":22,"id":"pci.22","bus":"pci.2","addr":"0x0"}' \ -object '{"qom-type":"iommufd","id":"iommufd0","fd":"24"}' \ -device '{"driver":"arm-smmuv3-accel","primary-bus":"pci.1","id":"smmuv3.0","accel":true}' \ -device '{"driver":"arm-smmuv3-accel","primary-bus":"pci.2","id":"smmuv3.1","accel":true}' \ -device '{"driver":"vfio-pci","host":"0009:01:00.0","id":"hostdev0","iommufd":"iommufd0","fd":"22","bus":"pci.21","addr":"0x0"}' \ -device '{"driver":"vfio-pci","host":"0019:01:00.0","id":"hostdev1","iommufd":"iommufd0","fd":"25","bus":"pci.22","addr":"0x0"}' \ Summary of changes: - Separated out commits for smmuv3Dev iommu model support and supporting multiple IOMMU definitions - Made iommufd only a hostdev attribute - Revised smmuv3Dev iommu model definition to reference the controller index instead of assigning it a BDF - Open iommufd FDs from libvirt backend without exposing FDs to XML users - Fixed iommufd path permissions - Matched qemu usage of Shameer's latest RFCv3 This series is on Github: https://github.com/NathanChenNVIDIA/libvirt/tree/smmuv3Dev-iommufd-08-12-25 Thanks, Nathan [0] https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/EASBQ... [1] https://lore.kernel.org/qemu-devel/20250714155941.22176-1-shameerali.kolothu... Signed-off-by: Nathan Chen <nathanc@nvidia.com> Nathan Chen (5): qemu: add IOMMU model smmuv3Dev conf: Support multiple smmuv3Dev IOMMU devices qemu: Implement support for associating iommufd to hostdev qemu: open iommufd FDs from libvirt backend qemu: Update Cgroup, namespace, and seclabel for qemu to access iommufd paths docs/formatdomain.rst | 22 ++- src/conf/domain_conf.c | 208 ++++++++++++++++++++++-- src/conf/domain_conf.h | 13 +- src/conf/domain_validate.c | 58 +++++-- src/conf/schemas/domaincommon.rng | 24 ++- src/libvirt_private.syms | 2 + src/qemu/qemu_alias.c | 15 +- src/qemu/qemu_cgroup.c | 61 +++++++ src/qemu/qemu_cgroup.h | 1 + src/qemu/qemu_command.c | 261 ++++++++++++++++++++++-------- src/qemu/qemu_command.h | 3 +- src/qemu/qemu_domain.c | 8 + src/qemu/qemu_domain.h | 7 + src/qemu/qemu_domain_address.c | 33 ++-- src/qemu/qemu_driver.c | 8 +- src/qemu/qemu_hotplug.c | 2 +- src/qemu/qemu_namespace.c | 44 +++++ src/qemu/qemu_postparse.c | 11 +- src/qemu/qemu_process.c | 232 ++++++++++++++++++++++++++ src/qemu/qemu_validate.c | 18 ++- src/security/security_apparmor.c | 11 ++ src/security/security_dac.c | 23 +++ src/security/security_selinux.c | 24 +++ src/util/virpci.c | 68 ++++++++ src/util/virpci.h | 1 + 25 files changed, 1020 insertions(+), 138 deletions(-) -- 2.43.0

Introduce support for "smmuv3Dev" IOMMU model and "parentIdx" and "accel" IOMMU device attributes. The former indicates the index of the controller that a smmuv3Dev IOMMU device is attached to, while the latter indicates whether hardware accelerated SMMUv3 support is to be enabled. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- docs/formatdomain.rst | 13 +++++- src/conf/domain_conf.c | 35 +++++++++++++++ src/conf/domain_conf.h | 3 ++ src/conf/domain_validate.c | 26 +++++++++-- src/conf/schemas/domaincommon.rng | 11 +++++ src/qemu/qemu_command.c | 73 +++++++++++++++++++++++++++++-- src/qemu/qemu_domain_address.c | 2 + src/qemu/qemu_validate.c | 16 +++++++ 8 files changed, 170 insertions(+), 9 deletions(-) diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index 976746e292..2558df18ef 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -9090,8 +9090,17 @@ Example: ``model`` Supported values are ``intel`` (for Q35 guests) ``smmuv3`` (:since:`since 5.5.0`, for ARM virt guests), ``virtio`` - (:since:`since 8.3.0`, for Q35 and ARM virt guests) and - ``amd`` (:since:`since 11.5.0`). + (:since:`since 8.3.0`, for Q35 and ARM virt guests), + ``amd`` (:since:`since 11.5.0`), and ``smmuv3Dev`` (for + ARM virt guests). + +``parentIdx`` + The ``parentIdx`` attribute notes the index of the controller that a + smmuv3Dev IOMMU device is attached to. + +``accel`` + The ``accel`` attribute with possible values ``on`` and ``off`` can be used + to enable hardware acceleration support for smmuv3Dev IOMMU devices. ``driver`` The ``driver`` subelement can be used to configure additional options, some diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 59958c2f08..dc222887d4 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -1353,6 +1353,7 @@ VIR_ENUM_IMPL(virDomainIOMMUModel, "smmuv3", "virtio", "amd", + "smmuv3Dev", ); VIR_ENUM_IMPL(virDomainVsockModel, @@ -2813,6 +2814,8 @@ virDomainIOMMUDefNew(void) iommu = g_new0(virDomainIOMMUDef, 1); + iommu->parent_idx = -1; + return g_steal_pointer(&iommu); } @@ -14362,6 +14365,14 @@ virDomainIOMMUDefParseXML(virDomainXMLOption *xmlopt, VIR_XML_PROP_REQUIRED, &iommu->model) < 0) return NULL; + if (virXMLPropInt(node, "parentIdx", 10, VIR_XML_PROP_NONE, + &iommu->parent_idx, -1) < 0) + return NULL; + + if (virXMLPropTristateSwitch(node, "accel", VIR_XML_PROP_NONE, + &iommu->accel) < 0) + return NULL; + if ((driver = virXPathNode("./driver", ctxt))) { if (virXMLPropTristateSwitch(driver, "intremap", VIR_XML_PROP_NONE, &iommu->intremap) < 0) @@ -22021,6 +22032,18 @@ virDomainIOMMUDefCheckABIStability(virDomainIOMMUDef *src, dst->aw_bits, src->aw_bits); return false; } + if (src->parent_idx != dst->parent_idx) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Target domain IOMMU device parent_idx value '%1$d' does not match source '%2$d'"), + dst->parent_idx, src->parent_idx); + return false; + } + if (src->accel != dst->accel) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Target domain IOMMU device accel value '%1$d' does not match source '%2$d'"), + dst->accel, src->accel); + return false; + } if (src->dma_translation != dst->dma_translation) { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, _("Target domain IOMMU device dma translation '%1$s' does not match source '%2$s'"), @@ -28342,6 +28365,18 @@ virDomainIOMMUDefFormat(virBuffer *buf, virBufferAsprintf(&attrBuf, " model='%s'", virDomainIOMMUModelTypeToString(iommu->model)); + if (iommu->parent_idx >= 0 && iommu->model == VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV) { + virBufferAsprintf(&attrBuf, " parentIdx='%d'", + iommu->parent_idx); + } + + if (iommu->model == VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV) { + if (iommu->accel != VIR_TRISTATE_SWITCH_ABSENT) { + virBufferAsprintf(&attrBuf, " accel='%s'", + virTristateSwitchTypeToString(iommu->accel)); + } + } + virXMLFormatElement(buf, "iommu", &attrBuf, &childBuf); } diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 596d138973..f87c5bbe93 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -3036,6 +3036,7 @@ typedef enum { VIR_DOMAIN_IOMMU_MODEL_SMMUV3, VIR_DOMAIN_IOMMU_MODEL_VIRTIO, VIR_DOMAIN_IOMMU_MODEL_AMD, + VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV, VIR_DOMAIN_IOMMU_MODEL_LAST } virDomainIOMMUModel; @@ -3047,10 +3048,12 @@ struct _virDomainIOMMUDef { virTristateSwitch eim; virTristateSwitch iotlb; unsigned int aw_bits; + int parent_idx; virDomainDeviceInfo info; virTristateSwitch dma_translation; virTristateSwitch xtsup; virTristateSwitch pt; + virTristateSwitch accel; }; typedef enum { diff --git a/src/conf/domain_validate.c b/src/conf/domain_validate.c index 40edecef83..f1b1b8cc55 100644 --- a/src/conf/domain_validate.c +++ b/src/conf/domain_validate.c @@ -3085,7 +3085,8 @@ virDomainIOMMUDefValidate(const virDomainIOMMUDef *iommu) iommu->eim != VIR_TRISTATE_SWITCH_ABSENT || iommu->iotlb != VIR_TRISTATE_SWITCH_ABSENT || iommu->aw_bits != 0 || - iommu->dma_translation != VIR_TRISTATE_SWITCH_ABSENT) { + iommu->dma_translation != VIR_TRISTATE_SWITCH_ABSENT || + iommu->accel != VIR_TRISTATE_SWITCH_ABSENT) { virReportError(VIR_ERR_XML_ERROR, _("iommu model '%1$s' doesn't support additional attributes"), virDomainIOMMUModelTypeToString(iommu->model)); @@ -3097,7 +3098,8 @@ virDomainIOMMUDefValidate(const virDomainIOMMUDef *iommu) if (iommu->caching_mode != VIR_TRISTATE_SWITCH_ABSENT || iommu->eim != VIR_TRISTATE_SWITCH_ABSENT || iommu->aw_bits != 0 || - iommu->dma_translation != VIR_TRISTATE_SWITCH_ABSENT) { + iommu->dma_translation != VIR_TRISTATE_SWITCH_ABSENT || + iommu->accel != VIR_TRISTATE_SWITCH_ABSENT) { virReportError(VIR_ERR_XML_ERROR, _("iommu model '%1$s' doesn't support some additional attributes"), virDomainIOMMUModelTypeToString(iommu->model)); @@ -3107,7 +3109,24 @@ virDomainIOMMUDefValidate(const virDomainIOMMUDef *iommu) case VIR_DOMAIN_IOMMU_MODEL_INTEL: if (iommu->pt != VIR_TRISTATE_SWITCH_ABSENT || - iommu->xtsup != VIR_TRISTATE_SWITCH_ABSENT) { + iommu->xtsup != VIR_TRISTATE_SWITCH_ABSENT || + iommu->accel != VIR_TRISTATE_SWITCH_ABSENT) { + virReportError(VIR_ERR_XML_ERROR, + _("iommu model '%1$s' doesn't support some additional attributes"), + virDomainIOMMUModelTypeToString(iommu->model)); + return -1; + } + break; + + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV: + if (iommu->intremap != VIR_TRISTATE_SWITCH_ABSENT || + iommu->caching_mode != VIR_TRISTATE_SWITCH_ABSENT || + iommu->eim != VIR_TRISTATE_SWITCH_ABSENT || + iommu->iotlb != VIR_TRISTATE_SWITCH_ABSENT || + iommu->aw_bits != 0 || + iommu->dma_translation != VIR_TRISTATE_SWITCH_ABSENT || + iommu->xtsup != VIR_TRISTATE_SWITCH_ABSENT || + iommu->pt != VIR_TRISTATE_SWITCH_ABSENT) { virReportError(VIR_ERR_XML_ERROR, _("iommu model '%1$s' doesn't support some additional attributes"), virDomainIOMMUModelTypeToString(iommu->model)); @@ -3132,6 +3151,7 @@ virDomainIOMMUDefValidate(const virDomainIOMMUDef *iommu) case VIR_DOMAIN_IOMMU_MODEL_VIRTIO: case VIR_DOMAIN_IOMMU_MODEL_AMD: + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV: case VIR_DOMAIN_IOMMU_MODEL_LAST: break; } diff --git a/src/conf/schemas/domaincommon.rng b/src/conf/schemas/domaincommon.rng index a714c3fcc5..0e57d2a9b9 100644 --- a/src/conf/schemas/domaincommon.rng +++ b/src/conf/schemas/domaincommon.rng @@ -6246,8 +6246,19 @@ <value>smmuv3</value> <value>virtio</value> <value>amd</value> + <value>smmuv3Dev</value> </choice> </attribute> + <optional> + <attribute name="parentIdx"> + <data type="unsignedInt"/> + </attribute> + </optional> + <optional> + <attribute name="accel"> + <ref name="virOnOff"/> + </attribute> + </optional> <interleave> <optional> <element name="driver"> diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 457dee7029..8a124a495b 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -6250,6 +6250,63 @@ qemuBuildBootCommandLine(virCommand *cmd, } +static virJSONValue * +qemuBuildPCISmmuv3DevDevProps(const virDomainDef *def, + const virDomainIOMMUDef *iommu) +{ + g_autoptr(virJSONValue) props = NULL; + g_autofree char *bus = NULL; + size_t i; + bool contIsPHB = false; + + for (i = 0; i < def->ncontrollers; i++) { + virDomainControllerDef *cont = def->controllers[i]; + if (cont->idx == iommu->parent_idx) { + if (cont->type == VIR_DOMAIN_CONTROLLER_TYPE_PCI) { + const char *alias = cont->info.alias; + contIsPHB = virDomainControllerIsPSeriesPHB(cont); + + if (!alias) + return NULL; + + if (virDomainDeviceAliasIsUserAlias(alias)) { + if (cont->model == VIR_DOMAIN_CONTROLLER_MODEL_PCI_ROOT && + iommu->parent_idx <= 0) { + if (qemuDomainSupportsPCIMultibus(def)) + bus = g_strdup("pci.0"); + else + bus = g_strdup("pci"); + } else if (cont->model == VIR_DOMAIN_CONTROLLER_MODEL_PCIE_ROOT) { + bus = g_strdup("pcie.0"); + } + } else { + bus = g_strdup(alias); + } + break; + } + } + } + + if (!bus) + return NULL; + + if (contIsPHB && iommu->parent_idx > 0) { + char *temp_bus = g_strdup_printf("%s.0", bus); + g_free(bus); + bus = temp_bus; + } + + if (virJSONValueObjectAdd(&props, + "s:driver", "arm-smmuv3", + "s:primary-bus", bus, + "b:accel", (iommu->accel == VIR_TRISTATE_SWITCH_ON), + NULL) < 0) + return NULL; + + return g_steal_pointer(&props); +} + + static int qemuBuildIOMMUCommandLine(virCommand *cmd, const virDomainDef *def, @@ -6298,7 +6355,6 @@ qemuBuildIOMMUCommandLine(virCommand *cmd, return 0; case VIR_DOMAIN_IOMMU_MODEL_SMMUV3: - /* There is no -device for SMMUv3, so nothing to be done here */ return 0; case VIR_DOMAIN_IOMMU_MODEL_AMD: @@ -6329,6 +6385,14 @@ qemuBuildIOMMUCommandLine(virCommand *cmd, return 0; + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV: + if (!(props = qemuBuildPCISmmuv3DevDevProps(def, iommu))) + return -1; + if (qemuBuildDeviceCommandlineFromJSON(cmd, props, def, qemuCaps) < 0) + return -1; + + return 0; + case VIR_DOMAIN_IOMMU_MODEL_LAST: default: virReportEnumRangeError(virDomainIOMMUModel, iommu->model); @@ -7162,6 +7226,7 @@ qemuBuildMachineCommandLine(virCommand *cmd, case VIR_DOMAIN_IOMMU_MODEL_INTEL: case VIR_DOMAIN_IOMMU_MODEL_VIRTIO: case VIR_DOMAIN_IOMMU_MODEL_AMD: + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV: /* These IOMMUs are formatted in qemuBuildIOMMUCommandLine */ break; @@ -10807,15 +10872,15 @@ qemuBuildCommandLine(virDomainObj *vm, if (qemuBuildBootCommandLine(cmd, def) < 0) return NULL; - if (qemuBuildIOMMUCommandLine(cmd, def, qemuCaps) < 0) - return NULL; - if (qemuBuildGlobalControllerCommandLine(cmd, def) < 0) return NULL; if (qemuBuildControllersCommandLine(cmd, def, qemuCaps) < 0) return NULL; + if (qemuBuildIOMMUCommandLine(cmd, def, qemuCaps) < 0) + return NULL; + if (qemuBuildMemoryDeviceCommandLine(cmd, cfg, def, priv) < 0) return NULL; diff --git a/src/qemu/qemu_domain_address.c b/src/qemu/qemu_domain_address.c index 96a9ca9b14..06bf4fab32 100644 --- a/src/qemu/qemu_domain_address.c +++ b/src/qemu/qemu_domain_address.c @@ -952,6 +952,7 @@ qemuDomainDeviceCalculatePCIConnectFlags(virDomainDeviceDef *dev, case VIR_DOMAIN_IOMMU_MODEL_INTEL: case VIR_DOMAIN_IOMMU_MODEL_SMMUV3: + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV: case VIR_DOMAIN_IOMMU_MODEL_LAST: /* These are not PCI devices */ return 0; @@ -2378,6 +2379,7 @@ qemuDomainAssignDevicePCISlots(virDomainDef *def, case VIR_DOMAIN_IOMMU_MODEL_INTEL: case VIR_DOMAIN_IOMMU_MODEL_SMMUV3: + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV: case VIR_DOMAIN_IOMMU_MODEL_LAST: /* These are not PCI devices */ break; diff --git a/src/qemu/qemu_validate.c b/src/qemu/qemu_validate.c index adba3e4a89..163d7758b8 100644 --- a/src/qemu/qemu_validate.c +++ b/src/qemu/qemu_validate.c @@ -5406,6 +5406,22 @@ qemuValidateDomainDeviceDefIOMMU(const virDomainIOMMUDef *iommu, } break; + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV: + if (!qemuDomainIsARMVirt(def)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("IOMMU device: '%1$s' is only supported with ARM Virt machines"), + virDomainIOMMUModelTypeToString(iommu->model)); + return -1; + } + // TODO: Check for pluggable device SMMUv3 qemu capability + if (!virQEMUCapsGet(qemuCaps, QEMU_CAPS_MACHINE_VIRT_IOMMU)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("IOMMU device: '%1$s' is not supported with this QEMU binary"), + virDomainIOMMUModelTypeToString(iommu->model)); + return -1; + } + break; + case VIR_DOMAIN_IOMMU_MODEL_LAST: default: virReportEnumRangeError(virDomainIOMMUModel, iommu->model); -- 2.43.0

Add support for parsing multiple IOMMU devices from the VM definition when "smmuv3Dev" is the IOMMU model. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/conf/domain_conf.c | 153 ++++++++++++++++++++++++++---- src/conf/domain_conf.h | 9 +- src/conf/domain_validate.c | 32 ++++--- src/conf/schemas/domaincommon.rng | 4 +- src/libvirt_private.syms | 2 + src/qemu/qemu_alias.c | 15 ++- src/qemu/qemu_command.c | 146 ++++++++++++++-------------- src/qemu/qemu_domain_address.c | 35 +++---- src/qemu/qemu_driver.c | 8 +- src/qemu/qemu_postparse.c | 11 ++- src/qemu/qemu_validate.c | 2 +- 11 files changed, 284 insertions(+), 133 deletions(-) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index dc222887d4..5ea4d6424b 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -4132,7 +4132,8 @@ void virDomainDefFree(virDomainDef *def) virDomainCryptoDefFree(def->cryptos[i]); g_free(def->cryptos); - virDomainIOMMUDefFree(def->iommu); + for (i = 0; i < def->niommus; i++) + virDomainIOMMUDefFree(def->iommu[i]); virDomainPstoreDefFree(def->pstore); @@ -5004,9 +5005,9 @@ virDomainDeviceInfoIterateFlags(virDomainDef *def, } device.type = VIR_DOMAIN_DEVICE_IOMMU; - if (def->iommu) { - device.data.iommu = def->iommu; - if ((rc = cb(def, &device, &def->iommu->info, opaque)) != 0) + for (i = 0; i < def->niommus; i++) { + device.data.iommu = def->iommu[i]; + if ((rc = cb(def, &device, &def->iommu[i]->info, opaque)) != 0) return rc; } @@ -16446,6 +16447,112 @@ virDomainInputDefFind(const virDomainDef *def, } +bool +virDomainIOMMUDefEquals(const virDomainIOMMUDef *a, + const virDomainIOMMUDef *b) +{ + if (a->model != b->model || + a->intremap != b->intremap || + a->caching_mode != b->caching_mode || + a->eim != b->eim || + a->iotlb != b->iotlb || + a->aw_bits != b->aw_bits || + a->parent_idx != b->parent_idx || + a->accel != b->accel || + a->dma_translation != b->dma_translation) + return false; + + switch (a->info.type) { + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_PCI: + if (a->info.addr.pci.domain != b->info.addr.pci.domain || + a->info.addr.pci.bus != b->info.addr.pci.bus || + a->info.addr.pci.slot != b->info.addr.pci.slot || + a->info.addr.pci.function != b->info.addr.pci.function) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_DRIVE: + if (a->info.addr.drive.controller != b->info.addr.drive.controller || + a->info.addr.drive.bus != b->info.addr.drive.bus || + a->info.addr.drive.unit != b->info.addr.drive.unit) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_SERIAL: + if (a->info.addr.vioserial.controller != b->info.addr.vioserial.controller || + a->info.addr.vioserial.bus != b->info.addr.vioserial.bus || + a->info.addr.vioserial.port != b->info.addr.vioserial.port) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_CCID: + if (a->info.addr.ccid.controller != b->info.addr.ccid.controller || + a->info.addr.ccid.slot != b->info.addr.ccid.slot) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_ISA: + if (a->info.addr.isa.iobase != b->info.addr.isa.iobase || + a->info.addr.isa.irq != b->info.addr.isa.irq) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_DIMM: + if (a->info.addr.dimm.slot != b->info.addr.dimm.slot) { + return false; + } + + if (a->info.addr.dimm.base != b->info.addr.dimm.base) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_CCW: + if (a->info.addr.ccw.cssid != b->info.addr.ccw.cssid || + a->info.addr.ccw.ssid != b->info.addr.ccw.ssid || + a->info.addr.ccw.devno != b->info.addr.ccw.devno) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_USB: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_SPAPRVIO: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_S390: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_MMIO: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_LAST: + break; + } + + if (a->info.acpiIndex != b->info.acpiIndex) { + return false; + } + + return true; +} + + +ssize_t +virDomainIOMMUDefFind(const virDomainDef *def, + const virDomainIOMMUDef *iommu) +{ + size_t i; + + for (i = 0; i < def->niommus; i++) { + if (virDomainIOMMUDefEquals(iommu, def->iommu[i])) + return i; + } + + return -1; +} + + bool virDomainVsockDefEquals(const virDomainVsockDef *a, const virDomainVsockDef *b) @@ -20098,19 +20205,28 @@ virDomainDefParseXML(xmlXPathContextPtr ctxt, } VIR_FREE(nodes); + /* analysis of iommu devices */ if ((n = virXPathNodeSet("./devices/iommu", ctxt, &nodes)) < 0) return NULL; - if (n > 1) { + if (n > 1 && !virXPathBoolean("./devices/iommu/@model = 'smmuv3Dev'", ctxt)) { virReportError(VIR_ERR_XML_ERROR, "%s", - _("only a single IOMMU device is supported")); + _("multiple IOMMU devices are only supported with model smmuv3Dev")); return NULL; } - if (n > 0) { - if (!(def->iommu = virDomainIOMMUDefParseXML(xmlopt, nodes[0], - ctxt, flags))) + if (n > 0) + def->iommu = g_new0(virDomainIOMMUDef *, n); + + for (i = 0; i < n; i++) { + virDomainIOMMUDef *iommu; + + iommu = virDomainIOMMUDefParseXML(xmlopt, nodes[i], ctxt, flags); + + if (!iommu) return NULL; + + def->iommu[def->niommus++] = iommu; } VIR_FREE(nodes); @@ -22558,15 +22674,17 @@ virDomainDefCheckABIStabilityFlags(virDomainDef *src, goto error; } - if (!!src->iommu != !!dst->iommu) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", - _("Target domain IOMMU device count does not match source")); + if (src->niommus != dst->niommus) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Target domain IOMMU device count %1$zu does not match source %2$zu"), + dst->niommus, src->niommus); goto error; } - if (src->iommu && - !virDomainIOMMUDefCheckABIStability(src->iommu, dst->iommu)) - goto error; + for (i = 0; i < src->niommus; i++) { + if (!virDomainIOMMUDefCheckABIStability(src->iommu[i], dst->iommu[i])) + goto error; + } if (!!src->vsock != !!dst->vsock) { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", @@ -29402,8 +29520,9 @@ virDomainDefFormatInternalSetRootName(virDomainDef *def, for (n = 0; n < def->ncryptos; n++) { virDomainCryptoDefFormat(buf, def->cryptos[n], flags); } - if (def->iommu) - virDomainIOMMUDefFormat(buf, def->iommu); + + for (n = 0; n < def->niommus; n++) + virDomainIOMMUDefFormat(buf, def->iommu[n]); if (def->vsock) virDomainVsockDefFormat(buf, def->vsock); diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index f87c5bbe93..edb18632f3 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -3294,6 +3294,9 @@ struct _virDomainDef { size_t nwatchdogs; virDomainWatchdogDef **watchdogs; + size_t niommus; + virDomainIOMMUDef **iommu; + /* At maximum 2 TPMs on the domain if a TPM Proxy is present. */ size_t ntpms; virDomainTPMDef **tpms; @@ -3303,7 +3306,6 @@ struct _virDomainDef { virDomainNVRAMDef *nvram; virCPUDef *cpu; virDomainRedirFilterDef *redirfilter; - virDomainIOMMUDef *iommu; virDomainVsockDef *vsock; virDomainPstoreDef *pstore; @@ -4308,6 +4310,11 @@ virDomainShmemDef *virDomainShmemDefRemove(virDomainDef *def, size_t idx) ssize_t virDomainInputDefFind(const virDomainDef *def, const virDomainInputDef *input) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) G_GNUC_WARN_UNUSED_RESULT; +bool virDomainIOMMUDefEquals(const virDomainIOMMUDef *a, + const virDomainIOMMUDef *b); +ssize_t virDomainIOMMUDefFind(const virDomainDef *def, + const virDomainIOMMUDef *iommu) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) G_GNUC_WARN_UNUSED_RESULT; bool virDomainVsockDefEquals(const virDomainVsockDef *a, const virDomainVsockDef *b) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) G_GNUC_WARN_UNUSED_RESULT; diff --git a/src/conf/domain_validate.c b/src/conf/domain_validate.c index f1b1b8cc55..b2f94b921f 100644 --- a/src/conf/domain_validate.c +++ b/src/conf/domain_validate.c @@ -1840,21 +1840,31 @@ virDomainDefCputuneValidate(const virDomainDef *def) static int virDomainDefIOMMUValidate(const virDomainDef *def) { + size_t i; + if (!def->iommu) return 0; - if (def->iommu->intremap == VIR_TRISTATE_SWITCH_ON && - def->features[VIR_DOMAIN_FEATURE_IOAPIC] != VIR_DOMAIN_IOAPIC_QEMU) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", - _("IOMMU interrupt remapping requires split I/O APIC (ioapic driver='qemu')")); - return -1; - } + for (i = 0; i < def->niommus; i++) { + virDomainIOMMUDef *iommu = def->iommu[i]; + if (def->niommus > 1 && iommu->model != VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("IOMMU model smmuv3Dev must be specified for multiple IOMMU definitions")); + } - if (def->iommu->eim == VIR_TRISTATE_SWITCH_ON && - def->iommu->intremap != VIR_TRISTATE_SWITCH_ON) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", - _("IOMMU eim requires interrupt remapping to be enabled")); - return -1; + if (iommu->intremap == VIR_TRISTATE_SWITCH_ON && + def->features[VIR_DOMAIN_FEATURE_IOAPIC] != VIR_DOMAIN_IOAPIC_QEMU) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("IOMMU interrupt remapping requires split I/O APIC (ioapic driver='qemu')")); + return -1; + } + + if (iommu->eim == VIR_TRISTATE_SWITCH_ON && + iommu->intremap != VIR_TRISTATE_SWITCH_ON) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("IOMMU eim requires interrupt remapping to be enabled")); + return -1; + } } return 0; diff --git a/src/conf/schemas/domaincommon.rng b/src/conf/schemas/domaincommon.rng index 0e57d2a9b9..fd19f115f7 100644 --- a/src/conf/schemas/domaincommon.rng +++ b/src/conf/schemas/domaincommon.rng @@ -6944,9 +6944,9 @@ <zeroOrMore> <ref name="panic"/> </zeroOrMore> - <optional> + <zeroOrMore> <ref name="iommu"/> - </optional> + </zeroOrMore> <optional> <ref name="vsock"/> </optional> diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index b846011f0f..924cfa1db7 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -491,6 +491,8 @@ virDomainInputSourceGrabToggleTypeToString; virDomainInputSourceGrabTypeFromString; virDomainInputSourceGrabTypeToString; virDomainInputTypeToString; +virDomainIOMMUDefEquals; +virDomainIOMMUDefFind; virDomainIOMMUDefFree; virDomainIOMMUDefNew; virDomainIOMMUModelTypeFromString; diff --git a/src/qemu/qemu_alias.c b/src/qemu/qemu_alias.c index a27c688d79..5f2b11b9a6 100644 --- a/src/qemu/qemu_alias.c +++ b/src/qemu/qemu_alias.c @@ -647,10 +647,14 @@ qemuAssignDeviceVsockAlias(virDomainVsockDef *vsock) static void -qemuAssignDeviceIOMMUAlias(virDomainIOMMUDef *iommu) +qemuAssignDeviceIOMMUAlias(virDomainDef *def, + virDomainIOMMUDef **iommu) { - if (!iommu->info.alias) - iommu->info.alias = g_strdup("iommu0"); + size_t i; + for (i = 0; i < def->niommus; i++) { + if (!iommu[i]->info.alias) + iommu[i]->info.alias = g_strdup_printf("iommu%zu", i); + } } @@ -766,8 +770,9 @@ qemuAssignDeviceAliases(virDomainDef *def) if (def->vsock) { qemuAssignDeviceVsockAlias(def->vsock); } - if (def->iommu) - qemuAssignDeviceIOMMUAlias(def->iommu); + if (def->iommu && def->niommus > 0) { + qemuAssignDeviceIOMMUAlias(def, def->iommu); + } for (i = 0; i < def->ncryptos; i++) { qemuAssignDeviceCryptoAlias(def, def->cryptos[i]); } diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 8a124a495b..cecd0661ca 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -6252,10 +6252,12 @@ qemuBuildBootCommandLine(virCommand *cmd, static virJSONValue * qemuBuildPCISmmuv3DevDevProps(const virDomainDef *def, - const virDomainIOMMUDef *iommu) + const virDomainIOMMUDef *iommu, + size_t id) { g_autoptr(virJSONValue) props = NULL; g_autofree char *bus = NULL; + g_autofree char *smmuv3_id = NULL; size_t i; bool contIsPHB = false; @@ -6296,9 +6298,12 @@ qemuBuildPCISmmuv3DevDevProps(const virDomainDef *def, bus = temp_bus; } + smmuv3_id = g_strdup_printf("smmuv3.%zu", id); + if (virJSONValueObjectAdd(&props, "s:driver", "arm-smmuv3", "s:primary-bus", bus, + "s:id", smmuv3_id, "b:accel", (iommu->accel == VIR_TRISTATE_SWITCH_ON), NULL) < 0) return NULL; @@ -6312,91 +6317,92 @@ qemuBuildIOMMUCommandLine(virCommand *cmd, const virDomainDef *def, virQEMUCaps *qemuCaps) { + size_t i; g_autoptr(virJSONValue) props = NULL; g_autoptr(virJSONValue) wrapperProps = NULL; - const virDomainIOMMUDef *iommu = def->iommu; - - if (!iommu) + if (!def->iommu || def->niommus <= 0) return 0; - switch (iommu->model) { - case VIR_DOMAIN_IOMMU_MODEL_INTEL: - if (virJSONValueObjectAdd(&props, - "s:driver", "intel-iommu", - "s:id", iommu->info.alias, - "S:intremap", qemuOnOffAuto(iommu->intremap), - "T:caching-mode", iommu->caching_mode, - "S:eim", qemuOnOffAuto(iommu->eim), - "T:device-iotlb", iommu->iotlb, - "z:aw-bits", iommu->aw_bits, - "T:dma-translation", iommu->dma_translation, - NULL) < 0) - return -1; + for (i = 0; i < def->niommus; i++) { + virDomainIOMMUDef *iommu = def->iommu[i]; + switch (iommu->model) { + case VIR_DOMAIN_IOMMU_MODEL_INTEL: + if (virJSONValueObjectAdd(&props, + "s:driver", "intel-iommu", + "s:id", iommu->info.alias, + "S:intremap", qemuOnOffAuto(iommu->intremap), + "T:caching-mode", iommu->caching_mode, + "S:eim", qemuOnOffAuto(iommu->eim), + "T:device-iotlb", iommu->iotlb, + "z:aw-bits", iommu->aw_bits, + "T:dma-translation", iommu->dma_translation, + NULL) < 0) + return -1; - if (qemuBuildDeviceCommandlineFromJSON(cmd, props, def, qemuCaps) < 0) - return -1; + if (qemuBuildDeviceCommandlineFromJSON(cmd, props, def, qemuCaps) < 0) + return -1; - return 0; + return 0; + case VIR_DOMAIN_IOMMU_MODEL_VIRTIO: + if (virJSONValueObjectAdd(&props, + "s:driver", "virtio-iommu", + "s:id", iommu->info.alias, + NULL) < 0) { + return -1; + } - case VIR_DOMAIN_IOMMU_MODEL_VIRTIO: - if (virJSONValueObjectAdd(&props, - "s:driver", "virtio-iommu", - "s:id", iommu->info.alias, - NULL) < 0) { - return -1; - } + if (qemuBuildDeviceAddressProps(props, def, &iommu->info) < 0) + return -1; - if (qemuBuildDeviceAddressProps(props, def, &iommu->info) < 0) - return -1; + if (qemuBuildDeviceCommandlineFromJSON(cmd, props, def, qemuCaps) < 0) + return -1; - if (qemuBuildDeviceCommandlineFromJSON(cmd, props, def, qemuCaps) < 0) - return -1; + return 0; + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3: + /* There is no -device for SMMUv3, so nothing to be done here */ + return 0; - return 0; + case VIR_DOMAIN_IOMMU_MODEL_AMD: + if (virJSONValueObjectAdd(&wrapperProps, + "s:driver", "AMDVI-PCI", + "s:id", iommu->info.alias, + NULL) < 0) + return -1; - case VIR_DOMAIN_IOMMU_MODEL_SMMUV3: - return 0; + if (qemuBuildDeviceAddressProps(wrapperProps, def, &iommu->info) < 0) + return -1; - case VIR_DOMAIN_IOMMU_MODEL_AMD: - if (virJSONValueObjectAdd(&wrapperProps, - "s:driver", "AMDVI-PCI", - "s:id", iommu->info.alias, - NULL) < 0) - return -1; + if (qemuBuildDeviceCommandlineFromJSON(cmd, wrapperProps, def, qemuCaps) < 0) + return -1; - if (qemuBuildDeviceAddressProps(wrapperProps, def, &iommu->info) < 0) - return -1; + if (virJSONValueObjectAdd(&props, + "s:driver", "amd-iommu", + "s:pci-id", iommu->info.alias, + "S:intremap", qemuOnOffAuto(iommu->intremap), + "T:pt", iommu->pt, + "T:xtsup", iommu->xtsup, + "T:device-iotlb", iommu->iotlb, + NULL) < 0) + return -1; - if (qemuBuildDeviceCommandlineFromJSON(cmd, wrapperProps, def, qemuCaps) < 0) - return -1; + if (qemuBuildDeviceCommandlineFromJSON(cmd, props, def, qemuCaps) < 0) + return -1; - if (virJSONValueObjectAdd(&props, - "s:driver", "amd-iommu", - "s:pci-id", iommu->info.alias, - "S:intremap", qemuOnOffAuto(iommu->intremap), - "T:pt", iommu->pt, - "T:xtsup", iommu->xtsup, - "T:device-iotlb", iommu->iotlb, - NULL) < 0) - return -1; + return 0; - if (qemuBuildDeviceCommandlineFromJSON(cmd, props, def, qemuCaps) < 0) - return -1; + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV: + if (!(props = qemuBuildPCISmmuv3DevDevProps(def, iommu, i))) + return -1; + if (qemuBuildDeviceCommandlineFromJSON(cmd, props, def, qemuCaps) < 0) + return -1; + break; - return 0; - case VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV: - if (!(props = qemuBuildPCISmmuv3DevDevProps(def, iommu))) - return -1; - if (qemuBuildDeviceCommandlineFromJSON(cmd, props, def, qemuCaps) < 0) + case VIR_DOMAIN_IOMMU_MODEL_LAST: + default: + virReportEnumRangeError(virDomainIOMMUModel, iommu->model); return -1; - - return 0; - - case VIR_DOMAIN_IOMMU_MODEL_LAST: - default: - virReportEnumRangeError(virDomainIOMMUModel, iommu->model); - return -1; + } } return 0; @@ -7217,8 +7223,8 @@ qemuBuildMachineCommandLine(virCommand *cmd, if (qemuAppendDomainFeaturesMachineParam(&buf, def, qemuCaps) < 0) return -1; - if (def->iommu) { - switch (def->iommu->model) { + if (def->iommu && def->niommus == 1) { + switch (def->iommu[0]->model) { case VIR_DOMAIN_IOMMU_MODEL_SMMUV3: virBufferAddLit(&buf, ",iommu=smmuv3"); break; @@ -7232,7 +7238,7 @@ qemuBuildMachineCommandLine(virCommand *cmd, case VIR_DOMAIN_IOMMU_MODEL_LAST: default: - virReportEnumRangeError(virDomainIOMMUModel, def->iommu->model); + virReportEnumRangeError(virDomainIOMMUModel, def->iommu[0]->model); return -1; } } diff --git a/src/qemu/qemu_domain_address.c b/src/qemu/qemu_domain_address.c index 06bf4fab32..2ddc629304 100644 --- a/src/qemu/qemu_domain_address.c +++ b/src/qemu/qemu_domain_address.c @@ -2365,24 +2365,25 @@ qemuDomainAssignDevicePCISlots(virDomainDef *def, /* Nada - none are PCI based (yet) */ } - if (def->iommu) { - virDomainIOMMUDef *iommu = def->iommu; - - switch (iommu->model) { - case VIR_DOMAIN_IOMMU_MODEL_VIRTIO: - case VIR_DOMAIN_IOMMU_MODEL_AMD: - if (virDeviceInfoPCIAddressIsWanted(&iommu->info) && - qemuDomainPCIAddressReserveNextAddr(addrs, &iommu->info) < 0) { - return -1; - } - break; + if (def->iommu && def->niommus > 0) { + for (i = 0; i < def->niommus; i++) { + virDomainIOMMUDef *iommu = def->iommu[i]; + switch (iommu->model) { + case VIR_DOMAIN_IOMMU_MODEL_VIRTIO: + case VIR_DOMAIN_IOMMU_MODEL_AMD: + if (virDeviceInfoPCIAddressIsWanted(&iommu->info) && + qemuDomainPCIAddressReserveNextAddr(addrs, &iommu->info) < 0) { + return -1; + } + break; - case VIR_DOMAIN_IOMMU_MODEL_INTEL: - case VIR_DOMAIN_IOMMU_MODEL_SMMUV3: - case VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV: - case VIR_DOMAIN_IOMMU_MODEL_LAST: - /* These are not PCI devices */ - break; + case VIR_DOMAIN_IOMMU_MODEL_INTEL: + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3: + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV: + case VIR_DOMAIN_IOMMU_MODEL_LAST: + /* These are not PCI devices */ + break; + } } } diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index ac72ea5cb0..3d65f78c9e 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -6894,12 +6894,12 @@ qemuDomainAttachDeviceConfig(virDomainDef *vmdef, break; case VIR_DOMAIN_DEVICE_IOMMU: - if (vmdef->iommu) { + if (vmdef->iommu && vmdef->niommus > 0) { virReportError(VIR_ERR_OPERATION_INVALID, "%s", _("domain already has an iommu device")); return -1; } - vmdef->iommu = g_steal_pointer(&dev->data.iommu); + VIR_APPEND_ELEMENT(vmdef->iommu, vmdef->niommus, dev->data.iommu); break; case VIR_DOMAIN_DEVICE_VIDEO: @@ -7113,12 +7113,12 @@ qemuDomainDetachDeviceConfig(virDomainDef *vmdef, break; case VIR_DOMAIN_DEVICE_IOMMU: - if (!vmdef->iommu) { + if ((idx = virDomainIOMMUDefFind(vmdef, dev->data.iommu)) < 0) { virReportError(VIR_ERR_OPERATION_FAILED, "%s", _("matching iommu device not found")); return -1; } - g_clear_pointer(&vmdef->iommu, virDomainIOMMUDefFree); + VIR_DELETE_ELEMENT(vmdef->iommu, idx, vmdef->niommus); break; case VIR_DOMAIN_DEVICE_VIDEO: diff --git a/src/qemu/qemu_postparse.c b/src/qemu/qemu_postparse.c index 9c2427970d..e2744a4a61 100644 --- a/src/qemu/qemu_postparse.c +++ b/src/qemu/qemu_postparse.c @@ -1503,7 +1503,7 @@ qemuDomainDefAddDefaultDevices(virQEMUDriver *driver, } } - if (addIOMMU && !def->iommu && + if (addIOMMU && !def->iommu && def->niommus == 0 && virQEMUCapsGet(qemuCaps, QEMU_CAPS_DEVICE_INTEL_IOMMU) && virQEMUCapsGet(qemuCaps, QEMU_CAPS_INTEL_IOMMU_INTREMAP) && virQEMUCapsGet(qemuCaps, QEMU_CAPS_INTEL_IOMMU_EIM)) { @@ -1515,7 +1515,8 @@ qemuDomainDefAddDefaultDevices(virQEMUDriver *driver, iommu->intremap = VIR_TRISTATE_SWITCH_ON; iommu->eim = VIR_TRISTATE_SWITCH_ON; - def->iommu = g_steal_pointer(&iommu); + def->iommu = g_new0(virDomainIOMMUDef *, 1); + def->iommu[def->niommus++] = g_steal_pointer(&iommu); } if (qemuDomainDefAddDefaultAudioBackend(driver, def) < 0) @@ -1591,9 +1592,9 @@ qemuDomainDefEnableDefaultFeatures(virDomainDef *def, * domain already has IOMMU without inremap. This will be fixed in * qemuDomainIOMMUDefPostParse() but there domain definition can't be * modified so change it now. */ - if (def->iommu && - (def->iommu->intremap == VIR_TRISTATE_SWITCH_ON || - qemuDomainNeedsIOMMUWithEIM(def)) && + if (def->iommu && def->niommus == 1 && + (def->iommu[0]->intremap == VIR_TRISTATE_SWITCH_ON || + qemuDomainNeedsIOMMUWithEIM(def)) && def->features[VIR_DOMAIN_FEATURE_IOAPIC] == VIR_DOMAIN_IOAPIC_NONE) { def->features[VIR_DOMAIN_FEATURE_IOAPIC] = VIR_DOMAIN_IOAPIC_QEMU; } diff --git a/src/qemu/qemu_validate.c b/src/qemu/qemu_validate.c index 163d7758b8..1b5b1deb5d 100644 --- a/src/qemu/qemu_validate.c +++ b/src/qemu/qemu_validate.c @@ -851,7 +851,7 @@ qemuValidateDomainVCpuTopology(const virDomainDef *def, virQEMUCaps *qemuCaps) QEMU_MAX_VCPUS_WITHOUT_EIM); return -1; } - if (!def->iommu || def->iommu->eim != VIR_TRISTATE_SWITCH_ON) { + if (!def->iommu || def->iommu[0]->eim != VIR_TRISTATE_SWITCH_ON) { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, _("more than %1$d vCPUs require extended interrupt mode enabled on the iommu device"), QEMU_MAX_VCPUS_WITHOUT_EIM); -- 2.43.0

On Thu, Aug 14, 2025 at 07:54:11PM -0700, Nathan Chen via Devel wrote:
Add support for parsing multiple IOMMU devices from the VM definition when "smmuv3Dev" is the IOMMU model.
Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/conf/domain_conf.c | 153 ++++++++++++++++++++++++++---- src/conf/domain_conf.h | 9 +- src/conf/domain_validate.c | 32 ++++--- src/conf/schemas/domaincommon.rng | 4 +- src/libvirt_private.syms | 2 + src/qemu/qemu_alias.c | 15 ++- src/qemu/qemu_command.c | 146 ++++++++++++++-------------- src/qemu/qemu_domain_address.c | 35 +++---- src/qemu/qemu_driver.c | 8 +- src/qemu/qemu_postparse.c | 11 ++- src/qemu/qemu_validate.c | 2 +- 11 files changed, 284 insertions(+), 133 deletions(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index dc222887d4..5ea4d6424b 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c
@@ -16446,6 +16447,112 @@ virDomainInputDefFind(const virDomainDef *def, }
+bool +virDomainIOMMUDefEquals(const virDomainIOMMUDef *a, + const virDomainIOMMUDef *b) +{ + if (a->model != b->model || + a->intremap != b->intremap || + a->caching_mode != b->caching_mode || + a->eim != b->eim || + a->iotlb != b->iotlb || + a->aw_bits != b->aw_bits || + a->parent_idx != b->parent_idx || + a->accel != b->accel || + a->dma_translation != b->dma_translation) + return false; + + switch (a->info.type) { + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_PCI: + if (a->info.addr.pci.domain != b->info.addr.pci.domain || + a->info.addr.pci.bus != b->info.addr.pci.bus || + a->info.addr.pci.slot != b->info.addr.pci.slot || + a->info.addr.pci.function != b->info.addr.pci.function) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_DRIVE: + if (a->info.addr.drive.controller != b->info.addr.drive.controller || + a->info.addr.drive.bus != b->info.addr.drive.bus || + a->info.addr.drive.unit != b->info.addr.drive.unit) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_SERIAL: + if (a->info.addr.vioserial.controller != b->info.addr.vioserial.controller || + a->info.addr.vioserial.bus != b->info.addr.vioserial.bus || + a->info.addr.vioserial.port != b->info.addr.vioserial.port) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_CCID: + if (a->info.addr.ccid.controller != b->info.addr.ccid.controller || + a->info.addr.ccid.slot != b->info.addr.ccid.slot) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_ISA: + if (a->info.addr.isa.iobase != b->info.addr.isa.iobase || + a->info.addr.isa.irq != b->info.addr.isa.irq) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_DIMM: + if (a->info.addr.dimm.slot != b->info.addr.dimm.slot) { + return false; + } + + if (a->info.addr.dimm.base != b->info.addr.dimm.base) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_CCW: + if (a->info.addr.ccw.cssid != b->info.addr.ccw.cssid || + a->info.addr.ccw.ssid != b->info.addr.ccw.ssid || + a->info.addr.ccw.devno != b->info.addr.ccw.devno) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_USB: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_SPAPRVIO: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_S390: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_MMIO: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_LAST: + break; + } + + if (a->info.acpiIndex != b->info.acpiIndex) { + return false; + }
Most of this should go away if you use virDomainDeviceInfoAddressIsEqual With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 8/27/2025 6:51 AM, Daniel P. Berrangé wrote:
Add support for parsing multiple IOMMU devices from the VM definition when "smmuv3Dev" is the IOMMU model.
Signed-off-by: Nathan Chen<nathanc@nvidia.com> --- src/conf/domain_conf.c | 153 ++++++++++++++++++++++++++---- src/conf/domain_conf.h | 9 +- src/conf/domain_validate.c | 32 ++++--- src/conf/schemas/domaincommon.rng | 4 +- src/libvirt_private.syms | 2 + src/qemu/qemu_alias.c | 15 ++- src/qemu/qemu_command.c | 146 ++++++++++++++-------------- src/qemu/qemu_domain_address.c | 35 +++---- src/qemu/qemu_driver.c | 8 +- src/qemu/qemu_postparse.c | 11 ++- src/qemu/qemu_validate.c | 2 +- 11 files changed, 284 insertions(+), 133 deletions(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index dc222887d4..5ea4d6424b 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -16446,6 +16447,112 @@ virDomainInputDefFind(const virDomainDef *def, }
+bool +virDomainIOMMUDefEquals(const virDomainIOMMUDef *a, + const virDomainIOMMUDef *b) +{ + if (a->model != b->model || + a->intremap != b->intremap || + a->caching_mode != b->caching_mode || + a->eim != b->eim || + a->iotlb != b->iotlb || + a->aw_bits != b->aw_bits || + a->parent_idx != b->parent_idx || + a->accel != b->accel || + a->dma_translation != b->dma_translation) + return false; + + switch (a->info.type) { + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_PCI: + if (a->info.addr.pci.domain != b->info.addr.pci.domain || + a->info.addr.pci.bus != b->info.addr.pci.bus || + a->info.addr.pci.slot != b->info.addr.pci.slot || + a->info.addr.pci.function != b->info.addr.pci.function) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_DRIVE: + if (a->info.addr.drive.controller != b->info.addr.drive.controller || + a->info.addr.drive.bus != b->info.addr.drive.bus || + a->info.addr.drive.unit != b->info.addr.drive.unit) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_SERIAL: + if (a->info.addr.vioserial.controller != b->info.addr.vioserial.controller || + a->info.addr.vioserial.bus != b->info.addr.vioserial.bus || + a->info.addr.vioserial.port != b->info.addr.vioserial.port) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_CCID: + if (a->info.addr.ccid.controller != b->info.addr.ccid.controller || + a->info.addr.ccid.slot != b->info.addr.ccid.slot) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_ISA: + if (a->info.addr.isa.iobase != b->info.addr.isa.iobase || + a->info.addr.isa.irq != b->info.addr.isa.irq) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_DIMM: + if (a->info.addr.dimm.slot != b->info.addr.dimm.slot) { + return false; + } + + if (a->info.addr.dimm.base != b->info.addr.dimm.base) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_CCW: + if (a->info.addr.ccw.cssid != b->info.addr.ccw.cssid || + a->info.addr.ccw.ssid != b->info.addr.ccw.ssid || + a->info.addr.ccw.devno != b->info.addr.ccw.devno) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_USB: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_SPAPRVIO: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_S390: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_MMIO: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_LAST: + break; + } + + if (a->info.acpiIndex != b->info.acpiIndex) { + return false; + } Most of this should go away if you use virDomainDeviceInfoAddressIsEqual
Thanks for the suggestion, I will simplify this section with virDomainDeviceInfoAddressIsEqual. Thanks, Nathan

Implement iommufdId attribute for hostdev devices that can be used to specify associated iommufd object when launching a qemu VM. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- docs/formatdomain.rst | 9 +++++++++ src/conf/domain_conf.c | 20 ++++++++++++++++++++ src/conf/domain_conf.h | 1 + src/conf/schemas/domaincommon.rng | 9 +++++++++ src/qemu/qemu_command.c | 14 ++++++++++++++ 5 files changed, 53 insertions(+) diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index 2558df18ef..e2b9be16c9 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -4581,6 +4581,7 @@ or: </source> <boot order='1'/> <rom bar='on' file='/etc/fake/boot.bin'/> + <iommufdId>iommufd0</iommufdId> </hostdev> </devices> ... @@ -4829,6 +4830,14 @@ or: device; if PCI ROM loading is disabled through this attribute, attempts to tweak the loading process further using the ``bar`` or ``file`` attributes will be rejected. :since:`Since 4.3.0 (QEMU and KVM only)`. +``iommufdId`` + The ``iommufdId`` element is used to specify using the iommufd interface to + propagate DMA mappings to the kernel, instead of legacy VFIO. When the + element is present, an iommufd object with its ID specified by ``iommufdId`` + will be created by the resulting qemu command. Libvirt will open the + /dev/iommu and VFIO device cdev, passing the associated file descriptor + numbers to the qemu command. + ``address`` The ``address`` element for USB devices has a ``bus`` and ``device`` attribute to specify the USB bus and device number the device appears at on diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 5ea4d6424b..38d8f2998a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -13581,6 +13581,15 @@ virDomainVideoDefParseXML(virDomainXMLOption *xmlopt, return g_steal_pointer(&def); } +static void +virDomainHostdevDefIommufdParseXML(xmlXPathContextPtr ctxt, + char** iommufdId) +{ + g_autofree char *iommufdIdtmp = virXPathString("string(./iommufdId)", ctxt); + if (iommufdIdtmp) + *iommufdId = g_steal_pointer(&iommufdIdtmp); +} + static virDomainHostdevDef * virDomainHostdevDefParseXML(virDomainXMLOption *xmlopt, xmlNodePtr node, @@ -13655,6 +13664,8 @@ virDomainHostdevDefParseXML(virDomainXMLOption *xmlopt, if (virDomainNetTeamingInfoParseXML(ctxt, &def->teaming) < 0) goto error; + virDomainHostdevDefIommufdParseXML(ctxt, &def->iommufdId); + return def; error: @@ -21195,6 +21206,11 @@ virDomainHostdevDefCheckABIStability(virDomainHostdevDef *src, } } + if (src->iommufdId && dst->iommufdId) { + if (STRNEQ(src->iommufdId, dst->iommufdId)) + return false; + } + if (!virDomainDeviceInfoCheckABIStability(src->info, dst->info)) return false; @@ -27554,6 +27570,10 @@ virDomainHostdevDefFormat(virBuffer *buf, if (def->shareable) virBufferAddLit(buf, "<shareable/>\n"); + if (def->iommufdId) { + virBufferAsprintf(buf, "<iommufdId>%s</iommufdId>\n", def->iommufdId); + } + virDomainDeviceInfoFormat(buf, def->info, flags | VIR_DOMAIN_DEF_FORMAT_ALLOW_BOOT | VIR_DOMAIN_DEF_FORMAT_ALLOW_ROM); diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index edb18632f3..367e7686f1 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -375,6 +375,7 @@ struct _virDomainHostdevDef { virDomainHostdevCaps caps; } source; virDomainNetTeamingInfo *teaming; + char *iommufdId; virDomainDeviceInfo *info; /* Guest address */ }; diff --git a/src/conf/schemas/domaincommon.rng b/src/conf/schemas/domaincommon.rng index fd19f115f7..662f12c4f1 100644 --- a/src/conf/schemas/domaincommon.rng +++ b/src/conf/schemas/domaincommon.rng @@ -6507,6 +6507,9 @@ <optional> <ref name="address"/> </optional> + <optional> + <ref name="iommufdId"/> + </optional> <optional> <element name="readonly"> <empty/> @@ -7761,6 +7764,12 @@ </element> </define> + <define name="iommufdId"> + <element name="iommufdId"> + <text/> + </element> + </define> + <define name="deviceBoot"> <element name="boot"> <attribute name="order"> diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index cecd0661ca..6b3e2ffd0d 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -4846,6 +4846,7 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, "S:failover_pair_id", failover_pair_id, "S:display", qemuOnOffAuto(pcisrc->display), "B:ramfb", ramfb, + "S:iommufd", dev->iommufdId, NULL) < 0) return NULL; @@ -5225,6 +5226,8 @@ qemuBuildHostdevCommandLine(virCommand *cmd, virQEMUCaps *qemuCaps) { size_t i; + g_autoptr(virJSONValue) props = NULL; + int iommufd = 0; for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevDef *hostdev = def->hostdevs[i]; @@ -5234,6 +5237,17 @@ qemuBuildHostdevCommandLine(virCommand *cmd, g_autofree char *vhostfdName = NULL; int vhostfd = -1; + if (hostdev->iommufdId && iommufd == 0) { + iommufd = 1; + if (qemuMonitorCreateObjectProps(&props, "iommufd", + hostdev->iommufdId, + NULL) < 0) + return -1; + + if (qemuBuildObjectCommandlineFromJSON(cmd, props) < 0) + return -1; + } + if (hostdev->mode != VIR_DOMAIN_HOSTDEV_MODE_SUBSYS) continue; -- 2.43.0

On Thu, Aug 14, 2025 at 07:54:12PM -0700, Nathan Chen via Devel wrote:
Implement iommufdId attribute for hostdev devices that can be used to specify associated iommufd object when launching a qemu VM.
Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- docs/formatdomain.rst | 9 +++++++++ src/conf/domain_conf.c | 20 ++++++++++++++++++++ src/conf/domain_conf.h | 1 + src/conf/schemas/domaincommon.rng | 9 +++++++++ src/qemu/qemu_command.c | 14 ++++++++++++++ 5 files changed, 53 insertions(+)
diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index 2558df18ef..e2b9be16c9 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -4581,6 +4581,7 @@ or: </source> <boot order='1'/> <rom bar='on' file='/etc/fake/boot.bin'/> + <iommufdId>iommufd0</iommufdId>
IIUC, the only place that is used is in the QEMU command line as an 'id' value. I'm sure we've discussed this before, but could you remind me - are we expecting every <hostdev> to have a separate iommufd FD, or are we expecting the same FD for all, or both/either ? ie we turn this into a simple yes/no flag ? If not, then we can probably turn this into a simple index value to express the uniqueness / sharing characteristics, without exposing the QEMU ID string concept directly. Either way, we can probably stuff this under <driver> rather than creating a new element eg <driver .... iommufd=yes|no> or <driver .... iommufdIndex="NNNN"/> depending on the answer to the previous Q>
</hostdev> </devices> ... @@ -4829,6 +4830,14 @@ or: device; if PCI ROM loading is disabled through this attribute, attempts to tweak the loading process further using the ``bar`` or ``file`` attributes will be rejected. :since:`Since 4.3.0 (QEMU and KVM only)`. +``iommufdId`` + The ``iommufdId`` element is used to specify using the iommufd interface to + propagate DMA mappings to the kernel, instead of legacy VFIO. When the + element is present, an iommufd object with its ID specified by ``iommufdId`` + will be created by the resulting qemu command. Libvirt will open the + /dev/iommu and VFIO device cdev, passing the associated file descriptor + numbers to the qemu command. + ``address`` The ``address`` element for USB devices has a ``bus`` and ``device`` attribute to specify the USB bus and device number the device appears at on
With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 8/27/2025 7:04 AM, Daniel P. Berrangé wrote:
Implement iommufdId attribute for hostdev devices that can be used to specify associated iommufd object when launching a qemu VM.
Signed-off-by: Nathan Chen<nathanc@nvidia.com> --- docs/formatdomain.rst | 9 +++++++++ src/conf/domain_conf.c | 20 ++++++++++++++++++++ src/conf/domain_conf.h | 1 + src/conf/schemas/domaincommon.rng | 9 +++++++++ src/qemu/qemu_command.c | 14 ++++++++++++++ 5 files changed, 53 insertions(+)
diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index 2558df18ef..e2b9be16c9 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -4581,6 +4581,7 @@ or: </source> <boot order='1'/> <rom bar='on' file='/etc/fake/boot.bin'/> + <iommufdId>iommufd0</iommufdId> IIUC, the only place that is used is in the QEMU command line as an 'id' value.
I'm sure we've discussed this before, but could you remind me - are we expecting every <hostdev> to have a separate iommufd FD, or are we expecting the same FD for all, or both/either ?
ie we turn this into a simple yes/no flag ? If not, then we can probably turn this into a simple index value to express the uniqueness / sharing characteristics, without exposing the QEMU ID string concept directly.
Either way, we can probably stuff this under <driver> rather than creating a new element eg
<driver .... iommufd=yes|no>
or
<driver .... iommufdIndex="NNNN"/>
depending on the answer to the previous Q>
We would expect separate FDs for each VFIO cdev (/dev/vfio/devices/vfioX) and a shared FD among the devices for /dev/iommu (the <iommufdId> value here). Agreed that we can turn this into a simple yes/no flag and put it under <driver>. I will note this for the next revision. Thanks, Nathan Chen

Open iommufd FDs from libvirt backend without exposing these FDs to XML users, i.e. one per domain for /dev/iommu and one per iommufd hostdev for /dev/vfio/devices/vfioX, and pass the FD to qemu command line. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/qemu/qemu_command.c | 44 +++++++- src/qemu/qemu_command.h | 3 +- src/qemu/qemu_domain.c | 8 ++ src/qemu/qemu_domain.h | 7 ++ src/qemu/qemu_hotplug.c | 2 +- src/qemu/qemu_process.c | 232 ++++++++++++++++++++++++++++++++++++++++ 6 files changed, 290 insertions(+), 6 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 6b3e2ffd0d..359dbb2621 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -4797,7 +4797,8 @@ qemuBuildVideoCommandLine(virCommand *cmd, virJSONValue * qemuBuildPCIHostdevDevProps(const virDomainDef *def, - virDomainHostdevDef *dev) + virDomainHostdevDef *dev, + virDomainObj *vm) { g_autoptr(virJSONValue) props = NULL; virDomainHostdevSubsysPCI *pcisrc = &dev->source.subsys.u.pci; @@ -4807,6 +4808,13 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, const char *driver = NULL; /* 'ramfb' property must be omitted unless it's to be enabled */ bool ramfb = pcisrc->ramfb == VIR_TRISTATE_SWITCH_ON; + bool useIommufd = false; + qemuDomainObjPrivate *priv = vm ? vm->privateData : NULL; + + if (pcisrc->driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO && + dev->iommufdId) { + useIommufd = true; + } /* caller has to assign proper passthrough driver name */ switch (pcisrc->driver.name) { @@ -4850,6 +4858,18 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, NULL) < 0) return NULL; + if (useIommufd && priv) { + g_autofree char *vfioFdName = g_strdup_printf("vfio-%04x:%02x:%02x.%d", + pcisrc->addr.domain, pcisrc->addr.bus, + pcisrc->addr.slot, pcisrc->addr.function); + + int vfiofd = GPOINTER_TO_INT(g_hash_table_lookup(priv->vfioDeviceFds, vfioFdName)); + if (virJSONValueObjectAdd(&props, + "S:fd", g_strdup_printf("%d", vfiofd), + NULL) < 0) + return NULL; + } + if (qemuBuildDeviceAddressProps(props, def, dev->info) < 0) return NULL; @@ -5223,11 +5243,13 @@ qemuBuildHostdevSCSICommandLine(virCommand *cmd, static int qemuBuildHostdevCommandLine(virCommand *cmd, const virDomainDef *def, - virQEMUCaps *qemuCaps) + virQEMUCaps *qemuCaps, + virDomainObj *vm) { size_t i; g_autoptr(virJSONValue) props = NULL; int iommufd = 0; + qemuDomainObjPrivate *priv = vm->privateData; for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevDef *hostdev = def->hostdevs[i]; @@ -5239,8 +5261,11 @@ qemuBuildHostdevCommandLine(virCommand *cmd, if (hostdev->iommufdId && iommufd == 0) { iommufd = 1; + virCommandPassFD(cmd, priv->iommufd, VIR_COMMAND_PASS_FD_CLOSE_PARENT); + if (qemuMonitorCreateObjectProps(&props, "iommufd", hostdev->iommufdId, + "S:fd", g_strdup_printf("%d", priv->iommufd), NULL) < 0) return -1; @@ -5270,7 +5295,18 @@ qemuBuildHostdevCommandLine(virCommand *cmd, if (qemuCommandAddExtDevice(cmd, hostdev->info, def, qemuCaps) < 0) return -1; - if (!(devprops = qemuBuildPCIHostdevDevProps(def, hostdev))) + if (hostdev->iommufdId) { + virDomainHostdevSubsysPCI *pcisrc = &hostdev->source.subsys.u.pci; + g_autofree char *vfioFdName = g_strdup_printf("vfio-%04x:%02x:%02x.%d", + pcisrc->addr.domain, pcisrc->addr.bus, + pcisrc->addr.slot, pcisrc->addr.function); + + int vfiofd = GPOINTER_TO_INT(g_hash_table_lookup(priv->vfioDeviceFds, vfioFdName)); + + virCommandPassFD(cmd, vfiofd, VIR_COMMAND_PASS_FD_CLOSE_PARENT); + } + + if (!(devprops = qemuBuildPCIHostdevDevProps(def, hostdev, vm))) return -1; if (qemuBuildDeviceCommandlineFromJSON(cmd, devprops, def, qemuCaps) < 0) @@ -10960,7 +10996,7 @@ qemuBuildCommandLine(virDomainObj *vm, if (qemuBuildRedirdevCommandLine(cmd, def, qemuCaps) < 0) return NULL; - if (qemuBuildHostdevCommandLine(cmd, def, qemuCaps) < 0) + if (qemuBuildHostdevCommandLine(cmd, def, qemuCaps, vm) < 0) return NULL; if (migrateURI) diff --git a/src/qemu/qemu_command.h b/src/qemu/qemu_command.h index ad068f1f16..380aac261f 100644 --- a/src/qemu/qemu_command.h +++ b/src/qemu/qemu_command.h @@ -180,7 +180,8 @@ qemuBuildThreadContextProps(virJSONValue **tcProps, /* Current, best practice */ virJSONValue * qemuBuildPCIHostdevDevProps(const virDomainDef *def, - virDomainHostdevDef *dev); + virDomainHostdevDef *dev, + virDomainObj *vm); virJSONValue * qemuBuildRNGDevProps(const virDomainDef *def, diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index a2c7c88a7e..2086dbb575 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -1954,6 +1954,11 @@ qemuDomainObjPrivateFree(void *data) virChrdevFree(priv->devs); + if (priv->iommufd >= 0) { + virEventRemoveHandle(priv->iommufd); + priv->iommufd = -1; + } + if (priv->pidMonitored >= 0) { virEventRemoveHandle(priv->pidMonitored); priv->pidMonitored = -1; @@ -1975,6 +1980,7 @@ qemuDomainObjPrivateFree(void *data) g_clear_pointer(&priv->blockjobs, g_hash_table_unref); g_clear_pointer(&priv->fds, g_hash_table_unref); + g_clear_pointer(&priv->vfioDeviceFds, g_hash_table_unref); /* This should never be non-NULL if we get here, but just in case... */ if (priv->eventThread) { @@ -2003,7 +2009,9 @@ qemuDomainObjPrivateAlloc(void *opaque) priv->blockjobs = virHashNew(virObjectUnref); priv->fds = virHashNew(g_object_unref); + priv->vfioDeviceFds = g_hash_table_new(g_str_hash, g_str_equal); + priv->iommufd = -1; priv->pidMonitored = -1; /* agent commands block by default, user can choose different behavior */ diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 1afd932764..6460323554 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -266,6 +266,10 @@ struct _qemuDomainObjPrivate { /* named file descriptor groups associated with the VM */ GHashTable *fds; + int iommufd; + + GHashTable *vfioDeviceFds; + char *memoryBackingDir; }; @@ -1172,3 +1176,6 @@ qemuDomainCheckCPU(virArch arch, bool qemuDomainMachineSupportsFloppy(const char *machine, virQEMUCaps *qemuCaps); + +int qemuProcessOpenVfioFds(virDomainObj *vm); +void qemuProcessCloseVfioFds(virDomainObj *vm); diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index e9568af125..e0e693e251 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -1633,7 +1633,7 @@ qemuDomainAttachHostPCIDevice(virQEMUDriver *driver, goto error; } - if (!(devprops = qemuBuildPCIHostdevDevProps(vm->def, hostdev))) + if (!(devprops = qemuBuildPCIHostdevDevProps(vm->def, hostdev, vm))) goto error; qemuDomainObjEnterMonitor(vm); diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index a81c02c9d5..1bc779c6aa 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -25,6 +25,7 @@ #include <unistd.h> #include <signal.h> #include <sys/stat.h> +#include <dirent.h> #if WITH_SYS_SYSCALL_H # include <sys/syscall.h> #endif @@ -8025,6 +8026,9 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuExtDevicesStart(driver, vm, incomingMigrationExtDevices) < 0) goto cleanup; + if (qemuProcessOpenVfioFds(vm) < 0) + goto cleanup; + if (!(cmd = qemuBuildCommandLine(vm, incoming ? "defer" : NULL, vmop, @@ -10206,3 +10210,231 @@ qemuProcessHandleNbdkitExit(qemuNbdkitProcess *nbdkit, qemuProcessEventSubmit(vm, QEMU_PROCESS_EVENT_NBDKIT_EXITED, 0, 0, nbdkit); virObjectUnlock(vm); } + +/** + * qemuProcessOpenIommuFd: + * @vm: domain object + * @iommuFd: returned file descriptor + * + * Opens /dev/iommu file descriptor for the VM. + * + * Returns: 0 on success, -1 on failure + */ +static int +qemuProcessOpenIommuFd(virDomainObj *vm, int *iommuFd) +{ + int fd = -1; + + VIR_DEBUG("Opening IOMMU FD for domain %s", vm->def->name); + + if ((fd = open("/dev/iommu", O_RDWR | O_CLOEXEC)) < 0) { + if (errno == ENOENT) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("IOMMU FD support requires /dev/iommu device")); + } else { + virReportSystemError(errno, "%s", + _("cannot open /dev/iommu")); + } + return -1; + } + + *iommuFd = fd; + VIR_DEBUG("Opened IOMMU FD %d for domain %s", fd, vm->def->name); + return 0; +} + +/** + * qemuProcessGetVfioDevicePath: + * @hostdev: host device definition + * @vfioPath: returned VFIO device path + * + * Constructs the VFIO device path for a PCI hostdev. + * + * Returns: 0 on success, -1 on failure + */ +static int +qemuProcessGetVfioDevicePath(virDomainHostdevDef *hostdev, + char **vfioPath) +{ + virPCIDeviceAddress *addr; + g_autofree char *sysfsPath = NULL; + DIR *dir = NULL; + struct dirent *entry = NULL; + int ret = -1; + + if (hostdev->mode != VIR_DOMAIN_HOSTDEV_MODE_SUBSYS || + hostdev->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("VFIO FD only supported for PCI hostdevs")); + return -1; + } + + addr = &hostdev->source.subsys.u.pci.addr; + + /* Build sysfs path: /sys/bus/pci/devices/DDDD:BB:DD.F/vfio-dev/ */ + sysfsPath = g_strdup_printf("/sys/bus/pci/devices/" + "%04x:%02x:%02x.%d/vfio-dev/", + addr->domain, addr->bus, + addr->slot, addr->function); + + if (virDirOpen(&dir, sysfsPath) < 0) { + virReportSystemError(errno, + _("cannot open VFIO sysfs directory %1$s"), + sysfsPath); + return -1; + } + + /* Find the vfio device name in the directory */ + while (virDirRead(dir, &entry, sysfsPath) > 0) { + if (STRPREFIX(entry->d_name, "vfio")) { + *vfioPath = g_strdup_printf("/dev/vfio/devices/%s", entry->d_name); + ret = 0; + break; + } + } + + if (ret < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("cannot find VFIO device for PCI device %1$04x:%2$02x:%3$02x.%4$d"), + addr->domain, addr->bus, addr->slot, addr->function); + } + + virDirClose(dir); + return ret; +} + +/** + * qemuProcessOpenVfioDeviceFd: + * @hostdev: host device definition + * @vfioFd: returned file descriptor + * + * Opens the VFIO device file descriptor for a hostdev. + * + * Returns: 0 on success, -1 on failure + */ +static int +qemuProcessOpenVfioDeviceFd(virDomainHostdevDef *hostdev, + int *vfioFd) +{ + g_autofree char *vfioPath = NULL; + int fd = -1; + + if (qemuProcessGetVfioDevicePath(hostdev, &vfioPath) < 0) + return -1; + + VIR_DEBUG("Opening VFIO device %s", vfioPath); + + if ((fd = open(vfioPath, O_RDWR | O_CLOEXEC)) < 0) { + if (errno == ENOENT) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("VFIO device %1$s not found - ensure device is bound to vfio-pci driver"), + vfioPath); + } else { + virReportSystemError(errno, + _("cannot open VFIO device %1$s"), vfioPath); + } + return -1; + } + + *vfioFd = fd; + VIR_DEBUG("Opened VFIO device FD %d for %s", *vfioFd, vfioPath); + return 0; +} + +/** + * qemuProcessOpenVfioFds: + * @vm: domain object + * + * Opens all necessary VFIO file descriptors for the domain. + * + * Returns: 0 on success, -1 on failure + */ +int +qemuProcessOpenVfioFds(virDomainObj *vm) +{ + qemuDomainObjPrivate *priv = vm->privateData; + bool needsIommuFd = false; + size_t i; + + /* Check if we have any hostdevs that need VFIO FDs */ + for (i = 0; i < vm->def->nhostdevs; i++) { + virDomainHostdevDef *hostdev = vm->def->hostdevs[i]; + int vfioFd = -1; + g_autofree char *fdname = NULL; + + if (hostdev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && + hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { + + /* Check if this hostdev uses VFIO with IOMMU FD */ + if (hostdev->source.subsys.u.pci.driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO && + hostdev->iommufdId) { + + needsIommuFd = true; + + /* Open VFIO device FD */ + if (qemuProcessOpenVfioDeviceFd(hostdev, &vfioFd) < 0) + goto error; + + /* Store the FD */ + fdname = g_strdup_printf("vfio-%04x:%02x:%02x.%d", + hostdev->source.subsys.u.pci.addr.domain, + hostdev->source.subsys.u.pci.addr.bus, + hostdev->source.subsys.u.pci.addr.slot, + hostdev->source.subsys.u.pci.addr.function); + + g_hash_table_insert(priv->vfioDeviceFds, g_steal_pointer(&fdname), GINT_TO_POINTER(vfioFd)); + + VIR_DEBUG("Stored VFIO FD for device %s", fdname); + } + } + } + + /* Open IOMMU FD if needed */ + if (needsIommuFd) { + int iommuFd = -1; + + if (qemuProcessOpenIommuFd(vm, &iommuFd) < 0) + goto error; + + priv->iommufd = iommuFd; + + VIR_DEBUG("Stored IOMMU FD"); + } + + return 0; + + error: + qemuProcessCloseVfioFds(vm); + return -1; +} + +/** + * qemuProcessCloseVfioFds: + * @vm: domain object + * + * Closes all VFIO file descriptors for the domain. + */ +void +qemuProcessCloseVfioFds(virDomainObj *vm) +{ + qemuDomainObjPrivate *priv = vm->privateData; + GHashTableIter iter; + gpointer key, value; + + /* Close all VFIO device FDs */ + if (priv->vfioDeviceFds) { + g_hash_table_iter_init(&iter, priv->vfioDeviceFds); + while (g_hash_table_iter_next(&iter, &key, &value)) { + int fd = GPOINTER_TO_INT(value); + VIR_DEBUG("Closing VFIO device FD %d for %s", fd, (char*)key); + VIR_FORCE_CLOSE(fd); + } + g_hash_table_remove_all(priv->vfioDeviceFds); + } + + /* Close IOMMU FD */ + if (priv->iommufd >= 0) { + VIR_DEBUG("Closing IOMMU FD %d", priv->iommufd); + VIR_FORCE_CLOSE(priv->iommufd); + } +} -- 2.43.0

Allow access to /dev/iommu and /dev/vfio/devices/vfio* when launching a qemu VM with iommufd feature enabled. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/qemu/qemu_cgroup.c | 61 ++++++++++++++++++++++++++++ src/qemu/qemu_cgroup.h | 1 + src/qemu/qemu_namespace.c | 44 +++++++++++++++++++++ src/security/security_apparmor.c | 11 ++++++ src/security/security_dac.c | 23 +++++++++++ src/security/security_selinux.c | 24 +++++++++++ src/util/virpci.c | 68 ++++++++++++++++++++++++++++++++ src/util/virpci.h | 1 + 8 files changed, 233 insertions(+) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index f10976c2b0..73d0cb3a7a 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -462,6 +462,54 @@ qemuTeardownInputCgroup(virDomainObj *vm, } +int +qemuSetupIommufdCgroup(virDomainObj *vm) +{ + qemuDomainObjPrivate *priv = vm->privateData; + g_autoptr(DIR) dir = NULL; + struct dirent *dent; + g_autofree char *path = NULL; + int iommufd = 0; + size_t i; + + for (i = 0; i < vm->def->nhostdevs; i++) { + if (vm->def->hostdevs[i]->iommufdId) { + iommufd = 1; + break; + } + } + + if (iommufd == 1) { + if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_DEVICES)) + return 0; + if (virDirOpen(&dir, "/dev/vfio/devices") < 0) { + if (errno == ENOENT) + return 0; + return -1; + } + while (virDirRead(dir, &dent, "/dev/vfio/devices") > 0) { + if (STRPREFIX(dent->d_name, "vfio")) { + path = g_strdup_printf("/dev/vfio/devices/%s", dent->d_name); + } + if (path && + qemuCgroupAllowDevicePath(vm, path, + VIR_CGROUP_DEVICE_RW, false) < 0) { + return -1; + } + path = NULL; + } + if (virFileExists("/dev/iommu")) + path = g_strdup("/dev/iommu"); + if (path && + qemuCgroupAllowDevicePath(vm, path, + VIR_CGROUP_DEVICE_RW, false) < 0) { + return -1; + } + } + return 0; +} + + /** * qemuSetupHostdevCgroup: * vm: domain object @@ -760,6 +808,7 @@ qemuSetupDevicesCgroup(virDomainObj *vm) g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(priv->driver); const char *const *deviceACL = (const char *const *) cfg->cgroupDeviceACL; int rv = -1; + int iommufd = 0; size_t i; if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_DEVICES)) @@ -830,6 +879,18 @@ qemuSetupDevicesCgroup(virDomainObj *vm) return -1; } + for (i = 0; i < vm->def->nhostdevs; i++) { + if (vm->def->hostdevs[i]->iommufdId) { + iommufd = 1; + break; + } + } + + if (iommufd == 1) { + if (qemuSetupIommufdCgroup(vm) < 0) + return -1; + } + for (i = 0; i < vm->def->nmems; i++) { if (qemuSetupMemoryDevicesCgroup(vm, vm->def->mems[i]) < 0) return -1; diff --git a/src/qemu/qemu_cgroup.h b/src/qemu/qemu_cgroup.h index 3668034cde..bea677ba3c 100644 --- a/src/qemu/qemu_cgroup.h +++ b/src/qemu/qemu_cgroup.h @@ -42,6 +42,7 @@ int qemuSetupHostdevCgroup(virDomainObj *vm, int qemuTeardownHostdevCgroup(virDomainObj *vm, virDomainHostdevDef *dev) G_GNUC_WARN_UNUSED_RESULT; +int qemuSetupIommufdCgroup(virDomainObj *vm); int qemuSetupMemoryDevicesCgroup(virDomainObj *vm, virDomainMemoryDef *mem); int qemuTeardownMemoryDevicesCgroup(virDomainObj *vm, diff --git a/src/qemu/qemu_namespace.c b/src/qemu/qemu_namespace.c index f72da83929..965a304f7f 100644 --- a/src/qemu/qemu_namespace.c +++ b/src/qemu/qemu_namespace.c @@ -677,6 +677,47 @@ qemuDomainSetupLaunchSecurity(virDomainObj *vm, } +static int +qemuDomainSetupIommufd(virDomainObj *vm, + GSList **paths) +{ + g_autoptr(DIR) dir = NULL; + struct dirent *dent; + g_autofree char *path = NULL; + int iommufd = 0; + size_t i; + + for (i = 0; i < vm->def->nhostdevs; i++) { + if (vm->def->hostdevs[i]->iommufdId) { + iommufd = 1; + break; + } + } + + /* Check if iommufd is enabled */ + if (iommufd == 1) { + if (virDirOpen(&dir, "/dev/vfio/devices") < 0) { + if (errno == ENOENT) + return 0; + return -1; + } + while (virDirRead(dir, &dent, "/dev/vfio/devices") > 0) { + if (STRPREFIX(dent->d_name, "vfio")) { + path = g_strdup_printf("/dev/vfio/devices/%s", dent->d_name); + *paths = g_slist_prepend(*paths, g_steal_pointer(&path)); + } + } + path = NULL; + if (virFileExists("/dev/iommu")) + path = g_strdup("/dev/iommu"); + if (path) + *paths = g_slist_prepend(*paths, g_steal_pointer(&path)); + } + + return 0; +} + + static int qemuNamespaceMknodPaths(virDomainObj *vm, GSList *paths, @@ -700,6 +741,9 @@ qemuDomainBuildNamespace(virQEMUDriverConfig *cfg, if (qemuDomainSetupAllDisks(vm, &paths) < 0) return -1; + if (qemuDomainSetupIommufd(vm, &paths) < 0) + return -1; + if (qemuDomainSetupAllHostdevs(vm, &paths) < 0) return -1; diff --git a/src/security/security_apparmor.c b/src/security/security_apparmor.c index 68ac39611f..73dc750c94 100644 --- a/src/security/security_apparmor.c +++ b/src/security/security_apparmor.c @@ -856,6 +856,17 @@ AppArmorSetSecurityHostdevLabel(virSecurityManager *mgr, } ret = AppArmorSetSecurityPCILabel(pci, vfioGroupDev, ptr); VIR_FREE(vfioGroupDev); + + if (dev->iommufdId) { + g_autofree char *vfiofdDev = virPCIDeviceGetIOMMUFDDev(pci); + if (vfiofdDev) { + int ret2 = AppArmorSetSecurityPCILabel(pci, vfiofdDev, ptr); + if (ret2 < 0) + ret = ret2; + } else { + return -1; + } + } } else { ret = virPCIDeviceFileIterate(pci, AppArmorSetSecurityPCILabel, ptr); } diff --git a/src/security/security_dac.c b/src/security/security_dac.c index 2f788b872a..327e36466d 100644 --- a/src/security/security_dac.c +++ b/src/security/security_dac.c @@ -1290,6 +1290,18 @@ virSecurityDACSetHostdevLabel(virSecurityManager *mgr, ret = virSecurityDACSetHostdevLabelHelper(vfioGroupDev, false, &cbdata); + if (dev->iommufdId) { + g_autofree char *vfiofdDev = virPCIDeviceGetIOMMUFDDev(pci); + if (vfiofdDev) { + int ret2 = virSecurityDACSetHostdevLabelHelper(vfiofdDev, + false, + &cbdata); + if (ret2 < 0) + ret = ret2; + } else { + return -1; + } + } } else { ret = virPCIDeviceFileIterate(pci, virSecurityDACSetPCILabel, @@ -1450,6 +1462,17 @@ virSecurityDACRestoreHostdevLabel(virSecurityManager *mgr, ret = virSecurityDACRestoreFileLabelInternal(mgr, NULL, vfioGroupDev, false); + if (dev->iommufdId) { + g_autofree char *vfiofdDev = virPCIDeviceGetIOMMUFDDev(pci); + if (vfiofdDev) { + int ret2 = virSecurityDACRestoreFileLabelInternal(mgr, NULL, + vfiofdDev, false); + if (ret2 < 0) + ret = ret2; + } else { + return -1; + } + } } else { ret = virPCIDeviceFileIterate(pci, virSecurityDACRestorePCILabel, mgr); } diff --git a/src/security/security_selinux.c b/src/security/security_selinux.c index fa5d1568eb..60dcadd839 100644 --- a/src/security/security_selinux.c +++ b/src/security/security_selinux.c @@ -2248,6 +2248,19 @@ virSecuritySELinuxSetHostdevSubsysLabel(virSecurityManager *mgr, ret = virSecuritySELinuxSetHostdevLabelHelper(vfioGroupDev, false, &data); + if (dev->iommufdId) { + g_autofree char *vfiofdDev = virPCIDeviceGetIOMMUFDDev(pci); + if (vfiofdDev) { + int ret2 = virSecuritySELinuxSetHostdevLabelHelper(vfiofdDev, + false, + &data); + if (ret2 < 0) + ret = ret2; + } else { + return -1; + } + } + } else { ret = virPCIDeviceFileIterate(pci, virSecuritySELinuxSetPCILabel, &data); } @@ -2481,6 +2494,17 @@ virSecuritySELinuxRestoreHostdevSubsysLabel(virSecurityManager *mgr, return -1; ret = virSecuritySELinuxRestoreFileLabel(mgr, vfioGroupDev, false); + + if (dev->iommufdId) { + g_autofree char *vfiofdDev = virPCIDeviceGetIOMMUFDDev(pci); + if (vfiofdDev) { + int ret2 = virSecuritySELinuxRestoreFileLabel(mgr, vfiofdDev, false); + if (ret2 < 0) + ret = ret2; + } else { + return -1; + } + } } else { ret = virPCIDeviceFileIterate(pci, virSecuritySELinuxRestorePCILabel, mgr); } diff --git a/src/util/virpci.c b/src/util/virpci.c index 90617e69c6..6e6e5e47c0 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -2478,6 +2478,74 @@ virPCIDeviceGetIOMMUGroupDev(virPCIDevice *dev) return g_strdup_printf("/dev/vfio/%s", groupFile); } +/* virPCIDeviceGetIOMMUFDDev - return the name of the device used + * to control this PCI device's group (e.g. "/dev/vfio/devices/vfio15") + */ +char * +virPCIDeviceGetIOMMUFDDev(virPCIDevice *dev) +{ + g_autofree char *path = NULL; + const char *pci_addr = NULL; + g_autoptr(DIR) dir = NULL; + struct dirent *entry; + char *vfiodev = NULL; + + /* Get PCI device address */ + pci_addr = virPCIDeviceGetName(dev); + if (!pci_addr) + return NULL; + + /* First try: look in PCI device's vfio-dev subdirectory */ + path = g_strdup_printf("/sys/bus/pci/devices/%s/vfio-dev", pci_addr); + + if (virDirOpen(&dir, path) == 1) { + while (virDirRead(dir, &entry, path) > 0) { + if (!g_str_has_prefix(entry->d_name, "vfio")) + continue; + + vfiodev = g_strdup_printf("/dev/vfio/devices/%s", entry->d_name); + break; + } + /* g_autoptr will automatically close dir when it goes out of scope */ + dir = NULL; + } + + /* Second try: scan /sys/class/vfio-dev for matching device */ + if (!vfiodev) { + g_free(path); + path = g_strdup("/sys/class/vfio-dev"); + + if (virDirOpen(&dir, path) == 1) { + while (virDirRead(dir, &entry, path) > 0) { + g_autofree char *dev_link = NULL; + g_autofree char *target = NULL; + + if (!g_str_has_prefix(entry->d_name, "vfio")) + continue; + + dev_link = g_strdup_printf("/sys/class/vfio-dev/%s/device", entry->d_name); + + if (virFileResolveLink(dev_link, &target) < 0) + continue; + + if (strstr(target, pci_addr)) { + vfiodev = g_strdup_printf("/dev/vfio/devices/%s", entry->d_name); + break; + } + } + /* g_autoptr will automatically close dir */ + } + } + + /* Verify the device path exists and is accessible */ + if (vfiodev && !virFileExists(vfiodev)) { + VIR_FREE(vfiodev); + return NULL; + } + + return vfiodev; +} + static int virPCIDeviceDownstreamLacksACS(virPCIDevice *dev) { diff --git a/src/util/virpci.h b/src/util/virpci.h index fc538566e1..996ffab2f9 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -203,6 +203,7 @@ int virPCIDeviceAddressGetIOMMUGroupNum(virPCIDeviceAddress *addr); char *virPCIDeviceAddressGetIOMMUGroupDev(const virPCIDeviceAddress *devAddr); bool virPCIDeviceExists(const virPCIDeviceAddress *addr); char *virPCIDeviceGetIOMMUGroupDev(virPCIDevice *dev); +char *virPCIDeviceGetIOMMUFDDev(virPCIDevice *dev); int virPCIDeviceIsAssignable(virPCIDevice *dev, int strict_acs_check); -- 2.43.0

On Thu, Aug 14, 2025 at 07:54:09PM -0700, Nathan Chen via Devel wrote:
Hi,
This is a follow up to the second RFC patchset [0] for supporting multiple vSMMU instances and using iommufd to propagate DMA mappings to kernel for VM-assigned host devices in a qemu VM.
This patchset implements support for specifying multiple <iommu> devices within the VM definition when smmuv3Dev IOMMU model is specified, and is tested with Shameer's latest qemu RFC for HW-accelerated vSMMU devices [1]
Moreover, it adds a new 'iommufdId' attribute for hostdev devices to be associated with the iommufd object.
For instance, specifying the iommufd object and associated hostdev in a VM definition with multiple IOMMUs, configured to be routed to pcie-expander-bus controllers in a way where VFIO device to SMMUv3 associations are matched with the host:
<devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='252'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='2' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='248'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> ... <controller type='pci' index='21' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='21' port='0x0'/> <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </controller> <controller type='pci' index='22' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='22' port='0xa8'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </controller> ... <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> <iommufdId>iommufd0</iommufdId> <address type='pci' domain='0x0000' bus='0x15' slot='0x00' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0019' bus='0x01' slot='0x00' function='0x0'/> </source> <iommufdId>iommufd0</iommufdId> <address type='pci' domain='0x0000' bus='0x16' slot='0x00' function='0x0'/> </hostdev> <iommu model='smmuv3Dev' parentIdx='1' accel='on'/> <iommu model='smmuv3Dev' parentIdx='2' accel='on'/> </devices>
This would get translated to a qemu command line with the arguments below. Note that libvirt will open the /dev/iommu and VFIO cdev, passing the associated fd number to qemu:
-device '{"driver":"pxb-pcie","bus_nr":252,"id":"pci.1","bus":"pcie.0","addr":"0x1"}' \ -device '{"driver":"pxb-pcie","bus_nr":248,"id":"pci.2","bus":"pcie.0","addr":"0x2"}' \ -device '{"driver":"pcie-root-port","port":0,"chassis":21,"id":"pci.21","bus":"pci.1","addr":"0x0"}' \ -device '{"driver":"pcie-root-port","port":168,"chassis":22,"id":"pci.22","bus":"pci.2","addr":"0x0"}' \ -object '{"qom-type":"iommufd","id":"iommufd0","fd":"24"}' \ -device '{"driver":"arm-smmuv3-accel","primary-bus":"pci.1","id":"smmuv3.0","accel":true}' \ -device '{"driver":"arm-smmuv3-accel","primary-bus":"pci.2","id":"smmuv3.1","accel":true}' \ -device '{"driver":"vfio-pci","host":"0009:01:00.0","id":"hostdev0","iommufd":"iommufd0","fd":"22","bus":"pci.21","addr":"0x0"}' \ -device '{"driver":"vfio-pci","host":"0019:01:00.0","id":"hostdev1","iommufd":"iommufd0","fd":"25","bus":"pci.22","addr":"0x0"}' \
Summary of changes: - Separated out commits for smmuv3Dev iommu model support and supporting multiple IOMMU definitions - Made iommufd only a hostdev attribute - Revised smmuv3Dev iommu model definition to reference the controller index instead of assigning it a BDF - Open iommufd FDs from libvirt backend without exposing FDs to XML users - Fixed iommufd path permissions - Matched qemu usage of Shameer's latest RFCv3
This series is on Github: https://github.com/NathanChenNVIDIA/libvirt/tree/smmuv3Dev-iommufd-08-12-25
Thanks, Nathan
[0] https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/EASBQ... [1] https://lore.kernel.org/qemu-devel/20250714155941.22176-1-shameerali.kolothu...
Signed-off-by: Nathan Chen <nathanc@nvidia.com>
Nathan Chen (5): qemu: add IOMMU model smmuv3Dev conf: Support multiple smmuv3Dev IOMMU devices qemu: Implement support for associating iommufd to hostdev qemu: open iommufd FDs from libvirt backend qemu: Update Cgroup, namespace, and seclabel for qemu to access iommufd paths
docs/formatdomain.rst | 22 ++- src/conf/domain_conf.c | 208 ++++++++++++++++++++++-- src/conf/domain_conf.h | 13 +- src/conf/domain_validate.c | 58 +++++-- src/conf/schemas/domaincommon.rng | 24 ++- src/libvirt_private.syms | 2 + src/qemu/qemu_alias.c | 15 +- src/qemu/qemu_cgroup.c | 61 +++++++ src/qemu/qemu_cgroup.h | 1 + src/qemu/qemu_command.c | 261 ++++++++++++++++++++++-------- src/qemu/qemu_command.h | 3 +- src/qemu/qemu_domain.c | 8 + src/qemu/qemu_domain.h | 7 + src/qemu/qemu_domain_address.c | 33 ++-- src/qemu/qemu_driver.c | 8 +- src/qemu/qemu_hotplug.c | 2 +- src/qemu/qemu_namespace.c | 44 +++++ src/qemu/qemu_postparse.c | 11 +- src/qemu/qemu_process.c | 232 ++++++++++++++++++++++++++ src/qemu/qemu_validate.c | 18 ++- src/security/security_apparmor.c | 11 ++ src/security/security_dac.c | 23 +++ src/security/security_selinux.c | 24 +++ src/util/virpci.c | 68 ++++++++ src/util/virpci.h | 1 + 25 files changed, 1020 insertions(+), 138 deletions(-)
We could do with some changes to the test suite to provide sample XML and CLI args for the iommufd XML schema. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 8/27/2025 7:01 AM, Daniel P. Berrangé wrote:
Hi,
This is a follow up to the second RFC patchset [0] for supporting multiple vSMMU instances and using iommufd to propagate DMA mappings to kernel for VM-assigned host devices in a qemu VM.
This patchset implements support for specifying multiple <iommu> devices within the VM definition when smmuv3Dev IOMMU model is specified, and is tested with Shameer's latest qemu RFC for HW-accelerated vSMMU devices [1]
Moreover, it adds a new 'iommufdId' attribute for hostdev devices to be associated with the iommufd object.
For instance, specifying the iommufd object and associated hostdev in a VM definition with multiple IOMMUs, configured to be routed to pcie-expander-bus controllers in a way where VFIO device to SMMUv3 associations are matched with the host:
<devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='252'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='2' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='248'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> ... <controller type='pci' index='21' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='21' port='0x0'/> <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </controller> <controller type='pci' index='22' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='22' port='0xa8'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </controller> ... <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> <iommufdId>iommufd0</iommufdId> <address type='pci' domain='0x0000' bus='0x15' slot='0x00' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0019' bus='0x01' slot='0x00' function='0x0'/> </source> <iommufdId>iommufd0</iommufdId> <address type='pci' domain='0x0000' bus='0x16' slot='0x00' function='0x0'/> </hostdev> <iommu model='smmuv3Dev' parentIdx='1' accel='on'/> <iommu model='smmuv3Dev' parentIdx='2' accel='on'/> </devices>
This would get translated to a qemu command line with the arguments below. Note that libvirt will open the /dev/iommu and VFIO cdev, passing the associated fd number to qemu:
-device '{"driver":"pxb-pcie","bus_nr":252,"id":"pci.1","bus":"pcie.0","addr":"0x1"}' \ -device '{"driver":"pxb-pcie","bus_nr":248,"id":"pci.2","bus":"pcie.0","addr":"0x2"}' \ -device '{"driver":"pcie-root-port","port":0,"chassis":21,"id":"pci.21","bus":"pci.1","addr":"0x0"}' \ -device '{"driver":"pcie-root-port","port":168,"chassis":22,"id":"pci.22","bus":"pci.2","addr":"0x0"}' \ -object '{"qom-type":"iommufd","id":"iommufd0","fd":"24"}' \ -device '{"driver":"arm-smmuv3-accel","primary-bus":"pci.1","id":"smmuv3.0","accel":true}' \ -device '{"driver":"arm-smmuv3-accel","primary-bus":"pci.2","id":"smmuv3.1","accel":true}' \ -device '{"driver":"vfio-pci","host":"0009:01:00.0","id":"hostdev0","iommufd":"iommufd0","fd":"22","bus":"pci.21","addr":"0x0"}' \ -device '{"driver":"vfio-pci","host":"0019:01:00.0","id":"hostdev1","iommufd":"iommufd0","fd":"25","bus":"pci.22","addr":"0x0"}' \
Summary of changes: - Separated out commits for smmuv3Dev iommu model support and supporting multiple IOMMU definitions - Made iommufd only a hostdev attribute - Revised smmuv3Dev iommu model definition to reference the controller index instead of assigning it a BDF - Open iommufd FDs from libvirt backend without exposing FDs to XML users - Fixed iommufd path permissions - Matched qemu usage of Shameer's latest RFCv3
This series is on Github: https://github.com/NathanChenNVIDIA/libvirt/tree/smmuv3Dev- iommufd-08-12-25
Thanks, Nathan
[0]https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/ thread/EASBQHPCLPK5G3PF3DEU57G6CI4GSC74/ [1]https://lore.kernel.org/qemu-devel/20250714155941.22176-1- shameerali.kolothum.thodi@huawei.com/
Signed-off-by: Nathan Chen<nathanc@nvidia.com>
Nathan Chen (5): qemu: add IOMMU model smmuv3Dev conf: Support multiple smmuv3Dev IOMMU devices qemu: Implement support for associating iommufd to hostdev qemu: open iommufd FDs from libvirt backend qemu: Update Cgroup, namespace, and seclabel for qemu to access iommufd paths
docs/formatdomain.rst | 22 ++- src/conf/domain_conf.c | 208 ++++++++++++++++++++++-- src/conf/domain_conf.h | 13 +- src/conf/domain_validate.c | 58 +++++-- src/conf/schemas/domaincommon.rng | 24 ++- src/libvirt_private.syms | 2 + src/qemu/qemu_alias.c | 15 +- src/qemu/qemu_cgroup.c | 61 +++++++ src/qemu/qemu_cgroup.h | 1 + src/qemu/qemu_command.c | 261 ++++++++++++++++++++++-------- src/qemu/qemu_command.h | 3 +- src/qemu/qemu_domain.c | 8 + src/qemu/qemu_domain.h | 7 + src/qemu/qemu_domain_address.c | 33 ++-- src/qemu/qemu_driver.c | 8 +- src/qemu/qemu_hotplug.c | 2 +- src/qemu/qemu_namespace.c | 44 +++++ src/qemu/qemu_postparse.c | 11 +- src/qemu/qemu_process.c | 232 ++++++++++++++++++++++++++ src/qemu/qemu_validate.c | 18 ++- src/security/security_apparmor.c | 11 ++ src/security/security_dac.c | 23 +++ src/security/security_selinux.c | 24 +++ src/util/virpci.c | 68 ++++++++ src/util/virpci.h | 1 + 25 files changed, 1020 insertions(+), 138 deletions(-) We could do with some changes to the test suite to provide sample XML and CLI args for the iommufd XML schema.
Yes, I will include some sample XML and CLI args in the next revision. We will have to mock the fd numbers generated for the CLI command. Thanks, Nathan
participants (2)
-
Daniel P. Berrangé
-
Nathan Chen