[RFC PATCH 0/5] qemu: Implement support for iommufd and multiple vSMMUs

Hi, This is a follow up to the first RFC patchset [0] for supporting multiple vSMMU instances in a qemu VM. This patchset also introduces support for using iommufd to propagate DMA mappings to kernel for assigned devices. This patchset implements support for specifying multiple <iommu> devices within the VM definition when smmuv3Dev IOMMU model is specified, and is tested with Shameer's latest qemu RFC for HW-accelerated vSMMU devices [1] Moreover, it adds a new 'iommufd' member for virDomainIOMMUDef, in order to represent the iommufd object in qemu command line. This patchset also implements new 'iommufdId' and 'iommufdFd' attributes for hostdev devices to be associated with the iommufd object. For instance, specifying the iommufd object and associated hostdev in a VM definition with multiple IOMMUs, configured to be routed to pcie-expander-bus controllers in a way where VFIO device to SMMUv3 associations are matched with the host (pcie-expander-bus and pcie-root-port controllers are no longer auto-added/auto-routed like in the first revision of this RFC, as the PCIe topology will be configured by management apps): <devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='252'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='2' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='248'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> ... <controller type='pci' index='21' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='21' port='0x0'/> <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </controller> <controller type='pci' index='22' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='22' port='0xa8'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </controller> ... <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> <iommufdId>iommufd0</iommufdId> <address type='pci' domain='0x0000' bus='0x15' slot='0x00' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0019' bus='0x01' slot='0x00' function='0x0'/> </source> <iommufdId>iommufd0</iommufdId> <address type='pci' domain='0x0000' bus='0x16' slot='0x00' function='0x0'/> </hostdev> <iommu model='smmuv3Dev'> <iommufd> <id>iommufd0</id> </iommufd> <address type='pci' domain='0x0000' bus='0x01' slot='0x01' function='0x0'/> </iommu> <iommu model='smmuv3Dev'> <iommufd> <id>iommufd0</id> </iommufd> <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/> </iommu> </devices> This would get translated to a qemu command line with the arguments below: -device '{"driver":"pxb-pcie","bus_nr":252,"id":"pci.1","bus":"pcie.0","addr":"0x1"}' \ -device '{"driver":"pxb-pcie","bus_nr":248,"id":"pci.2","bus":"pcie.0","addr":"0x2"}' \ -device '{"driver":"pcie-root-port","port":0,"chassis":21,"id":"pci.21","bus":"pci.1","addr":"0x0"}' \ -device '{"driver":"pcie-root-port","port":168,"chassis":22,"id":"pci.22","bus":"pci.2","addr":"0x0"}' \ -object '{"qom-type":"iommufd","id":"iommufd0"}' \ -device '{"driver":"arm-smmuv3-accel","bus":"pci.1"}' \ -device '{"driver":"arm-smmuv3-accel","bus":"pci.2"}' \ -device '{"driver":"vfio-pci","host":"0009:01:00.0","id":"hostdev0","iommufd":"iommufd0","bus":"pci.21","addr":"0x0"}' \ -device '{"driver":"vfio-pci","host":"0019:01:00.0","id":"hostdev1","iommufd":"iommufd0","bus":"pci.22","addr":"0x0"}' \ If users would like to leverage qemu's iommufd feature to open the VFIO cdev and /dev/iommu via an external management layer, the fd can be specified like so in the VM definition: <devices> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x06' slot='0x12' function='0x2'/> </source> <iommufdId>iommufd0</iommufdId> <iommufdFd>23</iommufdFd> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </hostdev> <iommu model='intel'> <iommufd> <id>iommufd0</id> <fd>22</fd> </iommufd> </iommu> </devices> This would get translated to a qemu command line with the arguments below: -object '{"qom-type":"iommufd","id":"iommufd0","fd":"22"}' \ -device '{"driver":"vfio-pci","host":"0000:06:12.2","id":"hostdev1","iommufd":"iommufd0","fd":"23","bus":"pci.0","addr":"0x3"}' \ Summary of changes: - Introduced support for specifying multiple <iommu> stanzas in the VM XML definition when using smmuv3Dev model. - Automating PCIe topology to populate VM definition with multiple vSMMUs routed to pcie-expander-bus controllers is excluded, in favor of deferring creation of PXBs and routing of VFIO devices to management apps. - Introduced iommufd support. TODO: - I updated the namespace and cgroup configuration to allow access to iommufd paths at /dev/vfio/devices/vfio* and /dev/iommu. However, qemu needs to be launched with user and group set to 'root' in order for these paths to be accessible. A passthrough device represented by /dev/vfio/18 normally has 'root' user and group permissions, but in the mount namespace it's changed to 'libvirt-qemu' and 'kvm'. I wasn't able to discern where this is happening by looking at src/qemu/qemu_namespace.c and src/qemu/qemu_cgroup.c. Would you have any pointers on how to change the iommufd paths' user and group permissions in the libvirt mount namespace? This series is on Github: https://github.com/NathanChenNVIDIA/libvirt/tree/smmuv3Dev-iommufd-04-15-25 Thanks, Nathan [0] https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/7GDT6... [1] https://lore.kernel.org/qemu-devel/20250311141045.66620-1-shameerali.kolothu... Signed-off-by: Nathan Chen <nathanc@nvidia.com> Nathan Chen (5): conf: Support multiple smmuv3Dev IOMMU devices conf: Add an iommufd member struct to virDomainIOMMUDef qemu: Implement support for associating iommufd to hostdev qemu: Update Cgroup and namespace for qemu to access iommufd paths qemu: Add test case for specifying iommufd docs/formatdomain.rst | 5 +- src/conf/domain_addr.c | 12 +- src/conf/domain_addr.h | 4 +- src/conf/domain_conf.c | 292 ++++++++++++++++-- src/conf/domain_conf.h | 21 +- src/conf/domain_validate.c | 94 +++++- src/conf/schemas/domaincommon.rng | 37 ++- src/conf/virconftypes.h | 2 + src/libvirt_private.syms | 2 + src/qemu/qemu_alias.c | 15 +- src/qemu/qemu_cgroup.c | 47 +++ src/qemu/qemu_cgroup.h | 1 + src/qemu/qemu_command.c | 146 ++++++--- src/qemu/qemu_domain_address.c | 33 +- src/qemu/qemu_driver.c | 8 +- src/qemu/qemu_namespace.c | 36 +++ src/qemu/qemu_postparse.c | 11 +- src/qemu/qemu_validate.c | 22 +- ...fio-iommufd-intel-iommu.x86_64-latest.args | 43 +++ ...vfio-iommufd-intel-iommu.x86_64-latest.xml | 80 +++++ .../hostdev-vfio-iommufd-intel-iommu.xml | 80 +++++ tests/qemuxmlconftest.c | 1 + 22 files changed, 878 insertions(+), 114 deletions(-) create mode 100644 tests/qemuxmlconfdata/hostdev-vfio-iommufd-intel-iommu.x86_64-latest.args create mode 100644 tests/qemuxmlconfdata/hostdev-vfio-iommufd-intel-iommu.x86_64-latest.xml create mode 100644 tests/qemuxmlconfdata/hostdev-vfio-iommufd-intel-iommu.xml -- 2.43.0

Add support for "smmuv3Dev" IOMMU model, and add support for parsing multiple IOMMU devices from the VM definition when "smmuv3Dev" is the IOMMU model. Enable plugging smmuv3Dev into pcie-root and pcie-expander-bus. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- docs/formatdomain.rst | 5 +- src/conf/domain_addr.c | 12 ++- src/conf/domain_addr.h | 4 +- src/conf/domain_conf.c | 153 ++++++++++++++++++++++++++---- src/conf/domain_conf.h | 11 ++- src/conf/domain_validate.c | 38 +++++--- src/conf/schemas/domaincommon.rng | 8 +- src/libvirt_private.syms | 2 + src/qemu/qemu_alias.c | 15 ++- src/qemu/qemu_command.c | 135 +++++++++++++++++--------- src/qemu/qemu_domain_address.c | 33 ++++--- src/qemu/qemu_driver.c | 8 +- src/qemu/qemu_postparse.c | 11 ++- src/qemu/qemu_validate.c | 22 ++++- 14 files changed, 342 insertions(+), 115 deletions(-) diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index c7c75ae219..f9b835da7d 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -9043,8 +9043,9 @@ Example: ... ``model`` - Supported values are ``intel`` (for Q35 guests) ``smmuv3`` - (:since:`since 5.5.0`, for ARM virt guests), and ``virtio`` + Supported values are ``intel`` (for Q35 guests), ``smmuv3`` + (:since:`since 5.5.0`, for ARM virt guests), ``smmuv3Dev`` + (for ARM virt guests), and ``virtio`` (:since:`since 8.3.0`, for Q35 and ARM virt guests). ``driver`` diff --git a/src/conf/domain_addr.c b/src/conf/domain_addr.c index 8dfa8feca0..6a75a7aee1 100644 --- a/src/conf/domain_addr.c +++ b/src/conf/domain_addr.c @@ -386,6 +386,8 @@ virDomainPCIAddressFlagsCompatible(virPCIDeviceAddress *addr, connectStr = "pcie-expander-bus"; } else if (devFlags & VIR_PCI_CONNECT_TYPE_PCI_BRIDGE) { connectStr = "pci-bridge"; + } else if (devFlags & VIR_PCI_CONNECT_TYPE_SMMUV3_DEV) { + connectStr = "arm-smmuv3-dev"; } else { /* this should never happen. If it does, there is a * bug in the code that sets the flag bits for devices. @@ -517,7 +519,8 @@ virDomainPCIAddressBusSetModel(virDomainPCIAddressBus *bus, bus->flags = (VIR_PCI_CONNECT_TYPE_PCIE_DEVICE | VIR_PCI_CONNECT_TYPE_PCIE_ROOT_PORT | VIR_PCI_CONNECT_TYPE_DMI_TO_PCI_BRIDGE | - VIR_PCI_CONNECT_TYPE_PCIE_EXPANDER_BUS); + VIR_PCI_CONNECT_TYPE_PCIE_EXPANDER_BUS | + VIR_PCI_CONNECT_TYPE_SMMUV3_DEV); bus->minSlot = 1; bus->maxSlot = VIR_PCI_ADDRESS_SLOT_LAST; break; @@ -561,11 +564,12 @@ virDomainPCIAddressBusSetModel(virDomainPCIAddressBus *bus, bus->maxSlot = VIR_PCI_ADDRESS_SLOT_LAST; break; case VIR_DOMAIN_CONTROLLER_MODEL_PCIE_EXPANDER_BUS: - /* 32 slots, no hotplug, only accepts pcie-root-port or - * dmi-to-pci-bridge + /* 32 slots, no hotplug, only accepts pcie-root-port, + * dmi-to-pci-bridge, or arm-smmuv3-dev */ bus->flags = (VIR_PCI_CONNECT_TYPE_PCIE_ROOT_PORT | - VIR_PCI_CONNECT_TYPE_DMI_TO_PCI_BRIDGE); + VIR_PCI_CONNECT_TYPE_DMI_TO_PCI_BRIDGE | + VIR_PCI_CONNECT_TYPE_SMMUV3_DEV); bus->minSlot = 0; bus->maxSlot = VIR_PCI_ADDRESS_SLOT_LAST; break; diff --git a/src/conf/domain_addr.h b/src/conf/domain_addr.h index e72fb48847..906ba9a52d 100644 --- a/src/conf/domain_addr.h +++ b/src/conf/domain_addr.h @@ -57,6 +57,7 @@ typedef enum { VIR_PCI_CONNECT_TYPE_PCIE_EXPANDER_BUS = 1 << 11, VIR_PCI_CONNECT_TYPE_PCI_BRIDGE = 1 << 12, VIR_PCI_CONNECT_TYPE_PCIE_TO_PCI_BRIDGE = 1 << 13, + VIR_PCI_CONNECT_TYPE_SMMUV3_DEV = 1 << 14, } virDomainPCIConnectFlags; /* a combination of all bits that describe the type of connections @@ -71,7 +72,8 @@ typedef enum { VIR_PCI_CONNECT_TYPE_PCI_EXPANDER_BUS | \ VIR_PCI_CONNECT_TYPE_PCIE_EXPANDER_BUS | \ VIR_PCI_CONNECT_TYPE_PCI_BRIDGE | \ - VIR_PCI_CONNECT_TYPE_PCIE_TO_PCI_BRIDGE) + VIR_PCI_CONNECT_TYPE_PCIE_TO_PCI_BRIDGE | \ + VIR_PCI_CONNECT_TYPE_SMMUV3_DEV) /* combination of all bits that could be used to connect a normal * endpoint device (i.e. excluding the connection possible between an diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index b3b0bd7329..c8d481eb5c 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -1350,6 +1350,7 @@ VIR_ENUM_IMPL(virDomainIOMMUModel, "intel", "smmuv3", "virtio", + "smmuv3Dev", ); VIR_ENUM_IMPL(virDomainVsockModel, @@ -4113,7 +4114,8 @@ void virDomainDefFree(virDomainDef *def) virDomainCryptoDefFree(def->cryptos[i]); g_free(def->cryptos); - virDomainIOMMUDefFree(def->iommu); + for (i = 0; i < def->niommus; i++) + virDomainIOMMUDefFree(def->iommu[i]); virDomainPstoreDefFree(def->pstore); @@ -4985,9 +4987,9 @@ virDomainDeviceInfoIterateFlags(virDomainDef *def, } device.type = VIR_DOMAIN_DEVICE_IOMMU; - if (def->iommu) { - device.data.iommu = def->iommu; - if ((rc = cb(def, &device, &def->iommu->info, opaque)) != 0) + for (i = 0; i < def->niommus; i++) { + device.data.iommu = def->iommu[i]; + if ((rc = cb(def, &device, &def->iommu[i]->info, opaque)) != 0) return rc; } @@ -16369,6 +16371,110 @@ virDomainInputDefFind(const virDomainDef *def, } +bool +virDomainIOMMUDefEquals(const virDomainIOMMUDef *a, + const virDomainIOMMUDef *b) +{ + if (a->model != b->model || + a->intremap != b->intremap || + a->caching_mode != b->caching_mode || + a->eim != b->eim || + a->iotlb != b->iotlb || + a->aw_bits != b->aw_bits || + a->dma_translation != b->dma_translation) + return false; + + switch (a->info.type) { + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_PCI: + if (a->info.addr.pci.domain != b->info.addr.pci.domain || + a->info.addr.pci.bus != b->info.addr.pci.bus || + a->info.addr.pci.slot != b->info.addr.pci.slot || + a->info.addr.pci.function != b->info.addr.pci.function) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_DRIVE: + if (a->info.addr.drive.controller != b->info.addr.drive.controller || + a->info.addr.drive.bus != b->info.addr.drive.bus || + a->info.addr.drive.unit != b->info.addr.drive.unit) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_SERIAL: + if (a->info.addr.vioserial.controller != b->info.addr.vioserial.controller || + a->info.addr.vioserial.bus != b->info.addr.vioserial.bus || + a->info.addr.vioserial.port != b->info.addr.vioserial.port) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_CCID: + if (a->info.addr.ccid.controller != b->info.addr.ccid.controller || + a->info.addr.ccid.slot != b->info.addr.ccid.slot) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_ISA: + if (a->info.addr.isa.iobase != b->info.addr.isa.iobase || + a->info.addr.isa.irq != b->info.addr.isa.irq) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_DIMM: + if (a->info.addr.dimm.slot != b->info.addr.dimm.slot) { + return false; + } + + if (a->info.addr.dimm.base != b->info.addr.dimm.base) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_CCW: + if (a->info.addr.ccw.cssid != b->info.addr.ccw.cssid || + a->info.addr.ccw.ssid != b->info.addr.ccw.ssid || + a->info.addr.ccw.devno != b->info.addr.ccw.devno) { + return false; + } + break; + + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_USB: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_SPAPRVIO: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_S390: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_VIRTIO_MMIO: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED: + case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_LAST: + break; + } + + if (a->info.acpiIndex != b->info.acpiIndex) { + return false; + } + + return true; +} + + +ssize_t +virDomainIOMMUDefFind(const virDomainDef *def, + const virDomainIOMMUDef *iommu) +{ + size_t i; + + for (i = 0; i < def->niommus; i++) { + if (virDomainIOMMUDefEquals(iommu, def->iommu[i])) + return i; + } + + return -1; +} + + bool virDomainVsockDefEquals(const virDomainVsockDef *a, const virDomainVsockDef *b) @@ -19375,6 +19481,7 @@ virDomainDefControllersParse(virDomainDef *def, return 0; } + static virDomainDef * virDomainDefParseXML(xmlXPathContextPtr ctxt, virDomainXMLOption *xmlopt, @@ -20021,19 +20128,28 @@ virDomainDefParseXML(xmlXPathContextPtr ctxt, } VIR_FREE(nodes); + /* analysis of iommu devices */ if ((n = virXPathNodeSet("./devices/iommu", ctxt, &nodes)) < 0) return NULL; - if (n > 1) { + if (n > 1 && !virXPathBoolean("./devices/iommu/@model = 'smmuv3Dev'", ctxt)) { virReportError(VIR_ERR_XML_ERROR, "%s", - _("only a single IOMMU device is supported")); + _("multiple IOMMU devices are only supported with model smmuv3Dev")); return NULL; } - if (n > 0) { - if (!(def->iommu = virDomainIOMMUDefParseXML(xmlopt, nodes[0], - ctxt, flags))) + if (n > 0) + def->iommu = g_new0(virDomainIOMMUDef *, n); + + for (i = 0; i < n; i++) { + virDomainIOMMUDef *iommu; + + iommu = virDomainIOMMUDefParseXML(xmlopt, nodes[i], ctxt, flags); + + if (!iommu) return NULL; + + def->iommu[def->niommus++] = iommu; } VIR_FREE(nodes); @@ -22455,15 +22571,17 @@ virDomainDefCheckABIStabilityFlags(virDomainDef *src, goto error; } - if (!!src->iommu != !!dst->iommu) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", - _("Target domain IOMMU device count does not match source")); + if (src->niommus != dst->niommus) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Target domain IOMMU device count %1$zu does not match source %2$zu"), + dst->niommus, src->niommus); goto error; } - if (src->iommu && - !virDomainIOMMUDefCheckABIStability(src->iommu, dst->iommu)) - goto error; + for (i = 0; i < src->niommus; i++) { + if (!virDomainIOMMUDefCheckABIStability(src->iommu[i], dst->iommu[i])) + goto error; + } if (!!src->vsock != !!dst->vsock) { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", @@ -29222,8 +29340,9 @@ virDomainDefFormatInternalSetRootName(virDomainDef *def, for (n = 0; n < def->ncryptos; n++) { virDomainCryptoDefFormat(buf, def->cryptos[n], flags); } - if (def->iommu) - virDomainIOMMUDefFormat(buf, def->iommu); + + for (n = 0; n < def->niommus; n++) + virDomainIOMMUDefFormat(buf, def->iommu[n]); if (def->vsock) virDomainVsockDefFormat(buf, def->vsock); diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 58b97a2b54..75568ff50a 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -3005,6 +3005,7 @@ typedef enum { VIR_DOMAIN_IOMMU_MODEL_INTEL, VIR_DOMAIN_IOMMU_MODEL_SMMUV3, VIR_DOMAIN_IOMMU_MODEL_VIRTIO, + VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV, VIR_DOMAIN_IOMMU_MODEL_LAST } virDomainIOMMUModel; @@ -3258,6 +3259,9 @@ struct _virDomainDef { size_t nwatchdogs; virDomainWatchdogDef **watchdogs; + size_t niommus; + virDomainIOMMUDef **iommu; + /* At maximum 2 TPMs on the domain if a TPM Proxy is present. */ size_t ntpms; virDomainTPMDef **tpms; @@ -3267,7 +3271,6 @@ struct _virDomainDef { virDomainNVRAMDef *nvram; virCPUDef *cpu; virDomainRedirFilterDef *redirfilter; - virDomainIOMMUDef *iommu; virDomainVsockDef *vsock; virDomainPstoreDef *pstore; @@ -4269,6 +4272,12 @@ virDomainShmemDef *virDomainShmemDefRemove(virDomainDef *def, size_t idx) ssize_t virDomainInputDefFind(const virDomainDef *def, const virDomainInputDef *input) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) G_GNUC_WARN_UNUSED_RESULT; + +bool virDomainIOMMUDefEquals(const virDomainIOMMUDef *a, + const virDomainIOMMUDef *b); +ssize_t virDomainIOMMUDefFind(const virDomainDef *def, + const virDomainIOMMUDef *iommu) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) G_GNUC_WARN_UNUSED_RESULT; bool virDomainVsockDefEquals(const virDomainVsockDef *a, const virDomainVsockDef *b) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) G_GNUC_WARN_UNUSED_RESULT; diff --git a/src/conf/domain_validate.c b/src/conf/domain_validate.c index d0d4bc0bf4..4861b1c002 100644 --- a/src/conf/domain_validate.c +++ b/src/conf/domain_validate.c @@ -1838,21 +1838,32 @@ virDomainDefCputuneValidate(const virDomainDef *def) static int virDomainDefIOMMUValidate(const virDomainDef *def) { + size_t i; + if (!def->iommu) - return 0; + return 0; - if (def->iommu->intremap == VIR_TRISTATE_SWITCH_ON && - def->features[VIR_DOMAIN_FEATURE_IOAPIC] != VIR_DOMAIN_IOAPIC_QEMU) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", - _("IOMMU interrupt remapping requires split I/O APIC (ioapic driver='qemu')")); - return -1; - } + for (i = 0; i < def->niommus; i++) { + virDomainIOMMUDef *iommu = def->iommu[i]; - if (def->iommu->eim == VIR_TRISTATE_SWITCH_ON && - def->iommu->intremap != VIR_TRISTATE_SWITCH_ON) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", - _("IOMMU eim requires interrupt remapping to be enabled")); - return -1; + if (def->niommus > 1 && iommu->model != VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("IOMMU model smmuv3Dev must be specified for multiple IOMMU definitions")); + } + + if (iommu->intremap == VIR_TRISTATE_SWITCH_ON && + def->features[VIR_DOMAIN_FEATURE_IOAPIC] != VIR_DOMAIN_IOAPIC_QEMU) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("IOMMU interrupt remapping requires split I/O APIC (ioapic driver='qemu')")); + return -1; + } + + if (iommu->eim == VIR_TRISTATE_SWITCH_ON && + iommu->intremap != VIR_TRISTATE_SWITCH_ON) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("IOMMU eim requires interrupt remapping to be enabled")); + return -1; + } } return 0; @@ -3066,6 +3077,7 @@ virDomainIOMMUDefValidate(const virDomainIOMMUDef *iommu) { switch (iommu->model) { case VIR_DOMAIN_IOMMU_MODEL_SMMUV3: + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV: case VIR_DOMAIN_IOMMU_MODEL_VIRTIO: if (iommu->intremap != VIR_TRISTATE_SWITCH_ABSENT || iommu->caching_mode != VIR_TRISTATE_SWITCH_ABSENT || @@ -3095,6 +3107,8 @@ virDomainIOMMUDefValidate(const virDomainIOMMUDef *iommu) return -1; } break; + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV: + break; case VIR_DOMAIN_IOMMU_MODEL_VIRTIO: case VIR_DOMAIN_IOMMU_MODEL_LAST: diff --git a/src/conf/schemas/domaincommon.rng b/src/conf/schemas/domaincommon.rng index 5597d5a66b..8d1768b24f 100644 --- a/src/conf/schemas/domaincommon.rng +++ b/src/conf/schemas/domaincommon.rng @@ -6190,6 +6190,7 @@ <value>intel</value> <value>smmuv3</value> <value>virtio</value> + <value>smmuv3Dev</value> </choice> </attribute> <interleave> @@ -6233,9 +6234,6 @@ <optional> <ref name="address"/> </optional> - <optional> - <ref name="alias"/> - </optional> </interleave> </element> </define> @@ -6867,9 +6865,9 @@ <zeroOrMore> <ref name="panic"/> </zeroOrMore> - <optional> + <zeroOrMore> <ref name="iommu"/> - </optional> + </zeroOrMore> <optional> <ref name="vsock"/> </optional> diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index a8ebf9efd8..143d3af8f3 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -491,6 +491,8 @@ virDomainInputSourceGrabToggleTypeToString; virDomainInputSourceGrabTypeFromString; virDomainInputSourceGrabTypeToString; virDomainInputTypeToString; +virDomainIOMMUDefEquals; +virDomainIOMMUDefFind; virDomainIOMMUDefFree; virDomainIOMMUDefNew; virDomainIOMMUModelTypeFromString; diff --git a/src/qemu/qemu_alias.c b/src/qemu/qemu_alias.c index 3e6bced4a8..d635eb0178 100644 --- a/src/qemu/qemu_alias.c +++ b/src/qemu/qemu_alias.c @@ -631,10 +631,14 @@ qemuAssignDeviceVsockAlias(virDomainVsockDef *vsock) static void -qemuAssignDeviceIOMMUAlias(virDomainIOMMUDef *iommu) +qemuAssignDeviceIOMMUAlias(virDomainDef *def, + virDomainIOMMUDef **iommu) { - if (!iommu->info.alias) - iommu->info.alias = g_strdup("iommu0"); + size_t i; + for (i = 0; i < def->niommus; i++) { + if (!iommu[i]->info.alias) + iommu[i]->info.alias = g_strdup_printf("iommu%zu", i); + } } @@ -750,8 +754,9 @@ qemuAssignDeviceAliases(virDomainDef *def) if (def->vsock) { qemuAssignDeviceVsockAlias(def->vsock); } - if (def->iommu) - qemuAssignDeviceIOMMUAlias(def->iommu); + if (def->iommu && def->niommus > 0) { + qemuAssignDeviceIOMMUAlias(def, def->iommu); + } for (i = 0; i < def->ncryptos; i++) { qemuAssignDeviceCryptoAlias(def, def->cryptos[i]); } diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index bf03561fde..9deafefaad 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -6128,60 +6128,106 @@ qemuBuildBootCommandLine(virCommand *cmd, } +static virJSONValue * +qemuBuildPCISmmuv3DevDevProps(const virDomainDef *def, + virDomainIOMMUDef *iommu) +{ + g_autoptr(virJSONValue) props = NULL; + g_autofree char *pciaddr = NULL; + g_autofree char *bus = qemuBuildDeviceAddressPCIGetBus(def, &iommu->info); + + if (!bus) + return NULL; + + if (iommu->info.addr.pci.function != 0) { + pciaddr = g_strdup_printf("0x%x.0x%x", iommu->info.addr.pci.slot, + iommu->info.addr.pci.function); + } else { + pciaddr = g_strdup_printf("0x%x", iommu->info.addr.pci.slot); + } + if (virJSONValueObjectAdd(&props, + "s:driver", "arm-smmuv3-accel", + "s:bus", bus, + NULL) < 0) + return NULL; + + return g_steal_pointer(&props); +} + + static int qemuBuildIOMMUCommandLine(virCommand *cmd, const virDomainDef *def, virQEMUCaps *qemuCaps) { + size_t i; g_autoptr(virJSONValue) props = NULL; - const virDomainIOMMUDef *iommu = def->iommu; - if (!iommu) + if (!def->iommu || def->niommus <= 0) return 0; - switch (iommu->model) { - case VIR_DOMAIN_IOMMU_MODEL_INTEL: - if (virJSONValueObjectAdd(&props, - "s:driver", "intel-iommu", - "s:id", iommu->info.alias, - "S:intremap", qemuOnOffAuto(iommu->intremap), - "T:caching-mode", iommu->caching_mode, - "S:eim", qemuOnOffAuto(iommu->eim), - "T:device-iotlb", iommu->iotlb, - "z:aw-bits", iommu->aw_bits, - "T:dma-translation", iommu->dma_translation, - NULL) < 0) - return -1; + for (i = 0; i < def->niommus; i++) { + virDomainIOMMUDef *iommu = def->iommu[i]; + switch (iommu->model) { + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV: + /* Ignore unassigned devices */ + if (iommu->info.type == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED) + continue; - if (qemuBuildDeviceCommandlineFromJSON(cmd, props, def, qemuCaps) < 0) - return -1; + if (qemuCommandAddExtDevice(cmd, &iommu->info, def, qemuCaps) < 0) + return -1; - return 0; + if (!(props = qemuBuildPCISmmuv3DevDevProps(def, iommu))) + return -1; - case VIR_DOMAIN_IOMMU_MODEL_VIRTIO: - if (virJSONValueObjectAdd(&props, - "s:driver", "virtio-iommu", - "s:id", iommu->info.alias, - NULL) < 0) { - return -1; - } + if (qemuBuildDeviceCommandlineFromJSON(cmd, props, def, qemuCaps) < 0) + return -1; - if (qemuBuildDeviceAddressProps(props, def, &iommu->info) < 0) - return -1; + break; - if (qemuBuildDeviceCommandlineFromJSON(cmd, props, def, qemuCaps) < 0) - return -1; + case VIR_DOMAIN_IOMMU_MODEL_INTEL: + if (virJSONValueObjectAdd(&props, + "s:driver", "intel-iommu", + "s:id", iommu->info.alias, + "S:intremap", qemuOnOffAuto(iommu->intremap), + "T:caching-mode", iommu->caching_mode, + "S:eim", qemuOnOffAuto(iommu->eim), + "T:device-iotlb", iommu->iotlb, + "z:aw-bits", iommu->aw_bits, + "T:dma-translation", iommu->dma_translation, + NULL) < 0) + return -1; - return 0; + if (qemuBuildDeviceCommandlineFromJSON(cmd, props, def, qemuCaps) < 0) + return -1; - case VIR_DOMAIN_IOMMU_MODEL_SMMUV3: - /* There is no -device for SMMUv3, so nothing to be done here */ - return 0; + return 0; - case VIR_DOMAIN_IOMMU_MODEL_LAST: - default: - virReportEnumRangeError(virDomainIOMMUModel, iommu->model); - return -1; + case VIR_DOMAIN_IOMMU_MODEL_VIRTIO: + if (virJSONValueObjectAdd(&props, + "s:driver", "virtio-iommu", + "s:id", iommu->info.alias, + NULL) < 0) { + return -1; + } + + if (qemuBuildDeviceAddressProps(props, def, &iommu->info) < 0) + return -1; + + if (qemuBuildDeviceCommandlineFromJSON(cmd, props, def, qemuCaps) < 0) + return -1; + + return 0; + + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3: + /* There is no -device for SMMUv3, so nothing to be done here */ + return 0; + + case VIR_DOMAIN_IOMMU_MODEL_LAST: + default: + virReportEnumRangeError(virDomainIOMMUModel, iommu->model); + return -1; + } } return 0; @@ -7001,8 +7047,8 @@ qemuBuildMachineCommandLine(virCommand *cmd, if (qemuAppendDomainFeaturesMachineParam(&buf, def, qemuCaps) < 0) return -1; - if (def->iommu) { - switch (def->iommu->model) { + if (def->iommu && def->niommus == 1) { + switch (def->iommu[0]->model) { case VIR_DOMAIN_IOMMU_MODEL_SMMUV3: virBufferAddLit(&buf, ",iommu=smmuv3"); break; @@ -7011,10 +7057,11 @@ qemuBuildMachineCommandLine(virCommand *cmd, case VIR_DOMAIN_IOMMU_MODEL_VIRTIO: /* These IOMMUs are formatted in qemuBuildIOMMUCommandLine */ break; - + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV: + break; case VIR_DOMAIN_IOMMU_MODEL_LAST: default: - virReportEnumRangeError(virDomainIOMMUModel, def->iommu->model); + virReportEnumRangeError(virDomainIOMMUModel, def->iommu[0]->model); return -1; } } @@ -10598,15 +10645,15 @@ qemuBuildCommandLine(virDomainObj *vm, if (qemuBuildBootCommandLine(cmd, def) < 0) return NULL; - if (qemuBuildIOMMUCommandLine(cmd, def, qemuCaps) < 0) - return NULL; - if (qemuBuildGlobalControllerCommandLine(cmd, def) < 0) return NULL; if (qemuBuildControllersCommandLine(cmd, def, qemuCaps) < 0) return NULL; + if (qemuBuildIOMMUCommandLine(cmd, def, qemuCaps) < 0) + return NULL; + if (qemuBuildMemoryDeviceCommandLine(cmd, cfg, def, priv) < 0) return NULL; diff --git a/src/qemu/qemu_domain_address.c b/src/qemu/qemu_domain_address.c index e89cdee487..aeacdbc9db 100644 --- a/src/qemu/qemu_domain_address.c +++ b/src/qemu/qemu_domain_address.c @@ -943,6 +943,8 @@ qemuDomainDeviceCalculatePCIConnectFlags(virDomainDeviceDef *dev, case VIR_DOMAIN_IOMMU_MODEL_INTEL: case VIR_DOMAIN_IOMMU_MODEL_SMMUV3: + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV: + return VIR_PCI_CONNECT_TYPE_SMMUV3_DEV; case VIR_DOMAIN_IOMMU_MODEL_LAST: /* These are not PCI devices */ return 0; @@ -2354,22 +2356,25 @@ qemuDomainAssignDevicePCISlots(virDomainDef *def, /* Nada - none are PCI based (yet) */ } - if (def->iommu) { - virDomainIOMMUDef *iommu = def->iommu; + if (def->iommu && def->niommus > 0) { + for (i = 0; i < def->niommus; i++) { + virDomainIOMMUDef *iommu = def->iommu[i]; - switch (iommu->model) { - case VIR_DOMAIN_IOMMU_MODEL_VIRTIO: - if (virDeviceInfoPCIAddressIsWanted(&iommu->info) && - qemuDomainPCIAddressReserveNextAddr(addrs, &iommu->info) < 0) { - return -1; - } - break; + switch (iommu->model) { + case VIR_DOMAIN_IOMMU_MODEL_VIRTIO: + if (virDeviceInfoPCIAddressIsWanted(&iommu->info) && + qemuDomainPCIAddressReserveNextAddr(addrs, &iommu->info) < 0) { + return -1; + } + break; - case VIR_DOMAIN_IOMMU_MODEL_INTEL: - case VIR_DOMAIN_IOMMU_MODEL_SMMUV3: - case VIR_DOMAIN_IOMMU_MODEL_LAST: - /* These are not PCI devices */ - break; + case VIR_DOMAIN_IOMMU_MODEL_INTEL: + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3: + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV: + case VIR_DOMAIN_IOMMU_MODEL_LAST: + /* These are not PCI devices */ + break; + } } } diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index a34d6f1437..8599761137 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -6894,12 +6894,12 @@ qemuDomainAttachDeviceConfig(virDomainDef *vmdef, break; case VIR_DOMAIN_DEVICE_IOMMU: - if (vmdef->iommu) { + if (vmdef->iommu && vmdef->niommus > 0) { virReportError(VIR_ERR_OPERATION_INVALID, "%s", _("domain already has an iommu device")); return -1; } - vmdef->iommu = g_steal_pointer(&dev->data.iommu); + VIR_APPEND_ELEMENT(vmdef->iommu, vmdef->niommus, dev->data.iommu); break; case VIR_DOMAIN_DEVICE_VIDEO: @@ -7113,12 +7113,12 @@ qemuDomainDetachDeviceConfig(virDomainDef *vmdef, break; case VIR_DOMAIN_DEVICE_IOMMU: - if (!vmdef->iommu) { + if ((idx = virDomainIOMMUDefFind(vmdef, dev->data.iommu)) < 0) { virReportError(VIR_ERR_OPERATION_FAILED, "%s", _("matching iommu device not found")); return -1; } - g_clear_pointer(&vmdef->iommu, virDomainIOMMUDefFree); + VIR_DELETE_ELEMENT(vmdef->iommu, idx, vmdef->niommus); break; case VIR_DOMAIN_DEVICE_VIDEO: diff --git a/src/qemu/qemu_postparse.c b/src/qemu/qemu_postparse.c index ed4af9ca8e..f2df3cba73 100644 --- a/src/qemu/qemu_postparse.c +++ b/src/qemu/qemu_postparse.c @@ -1450,7 +1450,7 @@ qemuDomainDefAddDefaultDevices(virQEMUDriver *driver, } } - if (addIOMMU && !def->iommu && + if (addIOMMU && !def->iommu && def->niommus == 0 && virQEMUCapsGet(qemuCaps, QEMU_CAPS_DEVICE_INTEL_IOMMU) && virQEMUCapsGet(qemuCaps, QEMU_CAPS_INTEL_IOMMU_INTREMAP) && virQEMUCapsGet(qemuCaps, QEMU_CAPS_INTEL_IOMMU_EIM)) { @@ -1462,7 +1462,8 @@ qemuDomainDefAddDefaultDevices(virQEMUDriver *driver, iommu->intremap = VIR_TRISTATE_SWITCH_ON; iommu->eim = VIR_TRISTATE_SWITCH_ON; - def->iommu = g_steal_pointer(&iommu); + def->iommu = g_new0(virDomainIOMMUDef *, 1); + def->iommu[def->niommus++] = g_steal_pointer(&iommu); } if (qemuDomainDefAddDefaultAudioBackend(driver, def) < 0) @@ -1538,9 +1539,9 @@ qemuDomainDefEnableDefaultFeatures(virDomainDef *def, * domain already has IOMMU without inremap. This will be fixed in * qemuDomainIOMMUDefPostParse() but there domain definition can't be * modified so change it now. */ - if (def->iommu && - (def->iommu->intremap == VIR_TRISTATE_SWITCH_ON || - qemuDomainNeedsIOMMUWithEIM(def)) && + if (def->iommu && def->niommus == 1 && + (def->iommu[0]->intremap == VIR_TRISTATE_SWITCH_ON || + qemuDomainNeedsIOMMUWithEIM(def)) && def->features[VIR_DOMAIN_FEATURE_IOAPIC] == VIR_DOMAIN_IOAPIC_NONE) { def->features[VIR_DOMAIN_FEATURE_IOAPIC] = VIR_DOMAIN_IOAPIC_QEMU; } diff --git a/src/qemu/qemu_validate.c b/src/qemu/qemu_validate.c index b2c3c9e2f6..f4431f1b2a 100644 --- a/src/qemu/qemu_validate.c +++ b/src/qemu/qemu_validate.c @@ -851,7 +851,12 @@ qemuValidateDomainVCpuTopology(const virDomainDef *def, virQEMUCaps *qemuCaps) QEMU_MAX_VCPUS_WITHOUT_EIM); return -1; } - if (!def->iommu || def->iommu->eim != VIR_TRISTATE_SWITCH_ON) { + if (!def->iommu || def->niommus == 0) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("iommu device not found. more than %1$d vCPUs require extended interrupt mode enabled on the iommu device"), + QEMU_MAX_VCPUS_WITHOUT_EIM); + return -1; + } else if (def->iommu[0]->eim != VIR_TRISTATE_SWITCH_ON) { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, _("more than %1$d vCPUs require extended interrupt mode enabled on the iommu device"), QEMU_MAX_VCPUS_WITHOUT_EIM); @@ -5225,6 +5230,21 @@ qemuValidateDomainDeviceDefIOMMU(const virDomainIOMMUDef *iommu, } break; + case VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV: + if (!qemuDomainIsARMVirt(def)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("IOMMU device: '%1$s' is only supported with ARM Virt machines"), + virDomainIOMMUModelTypeToString(iommu->model)); + return -1; + } + if (!virQEMUCapsGet(qemuCaps, QEMU_CAPS_MACHINE_VIRT_IOMMU)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("IOMMU device: '%1$s' is not supported with this QEMU binary"), + virDomainIOMMUModelTypeToString(iommu->model)); + return -1; + } + break; + case VIR_DOMAIN_IOMMU_MODEL_LAST: default: virReportEnumRangeError(virDomainIOMMUModel, iommu->model); -- 2.43.0

On Thu, May 15, 2025 at 01:36:39PM -0700, Nathan Chen via Devel wrote:
Add support for "smmuv3Dev" IOMMU model, and add support for parsing multiple IOMMU devices from the VM definition when "smmuv3Dev" is the IOMMU model. Enable plugging smmuv3Dev into pcie-root and pcie-expander-bus.
Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- docs/formatdomain.rst | 5 +- src/conf/domain_addr.c | 12 ++- src/conf/domain_addr.h | 4 +- src/conf/domain_conf.c | 153 ++++++++++++++++++++++++++---- src/conf/domain_conf.h | 11 ++- src/conf/domain_validate.c | 38 +++++--- src/conf/schemas/domaincommon.rng | 8 +- src/libvirt_private.syms | 2 + src/qemu/qemu_alias.c | 15 ++- src/qemu/qemu_command.c | 135 +++++++++++++++++--------- src/qemu/qemu_domain_address.c | 33 ++++--- src/qemu/qemu_driver.c | 8 +- src/qemu/qemu_postparse.c | 11 ++- src/qemu/qemu_validate.c | 22 ++++- 14 files changed, 342 insertions(+), 115 deletions(-)
diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 58b97a2b54..75568ff50a 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -3005,6 +3005,7 @@ typedef enum { VIR_DOMAIN_IOMMU_MODEL_INTEL, VIR_DOMAIN_IOMMU_MODEL_SMMUV3, VIR_DOMAIN_IOMMU_MODEL_VIRTIO, + VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV,
Can you remove this addition from this patch and put it in a following commit...
VIR_DOMAIN_IOMMU_MODEL_LAST } virDomainIOMMUModel; @@ -3258,6 +3259,9 @@ struct _virDomainDef { size_t nwatchdogs; virDomainWatchdogDef **watchdogs;
+ size_t niommus; + virDomainIOMMUDef **iommu;
...so we focus exclusively on this bit.
+ /* At maximum 2 TPMs on the domain if a TPM Proxy is present. */ size_t ntpms; virDomainTPMDef **tpms; @@ -3267,7 +3271,6 @@ struct _virDomainDef { virDomainNVRAMDef *nvram; virCPUDef *cpu; virDomainRedirFilterDef *redirfilter; - virDomainIOMMUDef *iommu; virDomainVsockDef *vsock; virDomainPstoreDef *pstore;
With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 5/20/2025 4:20 AM, Daniel P. Berrangé wrote:
Add support for "smmuv3Dev" IOMMU model, and add support for parsing multiple IOMMU devices from the VM definition when "smmuv3Dev" is the IOMMU model. Enable plugging smmuv3Dev into pcie-root and pcie-expander-bus.
Signed-off-by: Nathan Chen<nathanc@nvidia.com> --- docs/formatdomain.rst | 5 +- src/conf/domain_addr.c | 12 ++- src/conf/domain_addr.h | 4 +- src/conf/domain_conf.c | 153 ++++++++++++++++++++++++++---- src/conf/domain_conf.h | 11 ++- src/conf/domain_validate.c | 38 +++++--- src/conf/schemas/domaincommon.rng | 8 +- src/libvirt_private.syms | 2 + src/qemu/qemu_alias.c | 15 ++- src/qemu/qemu_command.c | 135 +++++++++++++++++--------- src/qemu/qemu_domain_address.c | 33 ++++--- src/qemu/qemu_driver.c | 8 +- src/qemu/qemu_postparse.c | 11 ++- src/qemu/qemu_validate.c | 22 ++++- 14 files changed, 342 insertions(+), 115 deletions(-)
diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 58b97a2b54..75568ff50a 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -3005,6 +3005,7 @@ typedef enum { VIR_DOMAIN_IOMMU_MODEL_INTEL, VIR_DOMAIN_IOMMU_MODEL_SMMUV3, VIR_DOMAIN_IOMMU_MODEL_VIRTIO, + VIR_DOMAIN_IOMMU_MODEL_SMMUV3_DEV, Can you remove this addition from this patch and put it in a following commit...
VIR_DOMAIN_IOMMU_MODEL_LAST } virDomainIOMMUModel; @@ -3258,6 +3259,9 @@ struct _virDomainDef { size_t nwatchdogs; virDomainWatchdogDef **watchdogs;
+ size_t niommus; + virDomainIOMMUDef **iommu;
...so we focus exclusively on this bit.
Yes, I will separate these changes out in the next revision so that we have separate commits - one for multiple IOMMU definitions and one for smmuv3Dev support.
+ /* At maximum 2 TPMs on the domain if a TPM Proxy is present. */ size_t ntpms; virDomainTPMDef **tpms; @@ -3267,7 +3271,6 @@ struct _virDomainDef { virDomainNVRAMDef *nvram; virCPUDef *cpu; virDomainRedirFilterDef *redirfilter; - virDomainIOMMUDef *iommu; virDomainVsockDef *vsock; virDomainPstoreDef *pstore;
With regards, Daniel
Thanks, Nathan

Add support for an "iommufd" virDomainIOMMUDef member that will be translated to a qemu "-object iommufd" command line argument. This iommufd struct has an "id" member to specify the iommufd object id that can be associated with a passthrough device, and an "fd" member to specify the fd for externally opening the /dev/iommu and VFIO cdev by a management layer. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/conf/domain_conf.c | 108 +++++++++++++++++++++++++++++- src/conf/domain_conf.h | 8 +++ src/conf/domain_validate.c | 44 +++++++++++- src/conf/schemas/domaincommon.rng | 14 ++++ src/conf/virconftypes.h | 2 + src/qemu/qemu_command.c | 13 ++++ 6 files changed, 186 insertions(+), 3 deletions(-) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index c8d481eb5c..9330c47b7a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -2814,6 +2814,9 @@ virDomainIOMMUDefFree(virDomainIOMMUDef *iommu) if (!iommu) return; + if (iommu->iommufd) + virDomainIommufdDefFree(iommu->iommufd); + virDomainDeviceInfoClear(&iommu->info); g_free(iommu); } @@ -14292,6 +14295,54 @@ virDomainMemoryDefParseXML(virDomainXMLOption *xmlopt, } +void virDomainIommufdDefFree(virDomainIommufdDef *def) +{ + if (!def) + return; + + g_free(def->id); + + g_free(def->fd); + + g_free(def); +} + + +static virDomainIommufdDef * +virDomainIommufdParseXML(xmlNodePtr node, + xmlXPathContextPtr ctxt) +{ + virDomainIommufdDef *def; + g_autofree char *iommufdId = NULL; + g_autofree char *iommufdFd = NULL; + VIR_XPATH_NODE_AUTORESTORE(ctxt) + + ctxt->node = node; + + def = g_new0(virDomainIommufdDef, 1); + + if (!(iommufdId = virXPathString("string(./id)", ctxt))) + goto error; + + def->id = g_strdup(iommufdId); + + iommufdFd = virXPathString("string(./fd)", ctxt); + + def->fd = g_strdup(iommufdFd); + + if (!def->id) + goto error; + + if (iommufdFd && !def->fd) + goto error; + + return def; + error: + virDomainIommufdDefFree(def); + return NULL; +} + + static virDomainIOMMUDef * virDomainIOMMUDefParseXML(virDomainXMLOption *xmlopt, xmlNodePtr node, @@ -14299,7 +14350,7 @@ virDomainIOMMUDefParseXML(virDomainXMLOption *xmlopt, unsigned int flags) { VIR_XPATH_NODE_AUTORESTORE(ctxt) - xmlNodePtr driver; + xmlNodePtr driver, iommufd; g_autoptr(virDomainIOMMUDef) iommu = NULL; ctxt->node = node; @@ -14336,6 +14387,11 @@ virDomainIOMMUDefParseXML(virDomainXMLOption *xmlopt, return NULL; } + if ((iommufd = virXPathNode("./iommufd", ctxt))) { + if (!(iommu->iommufd = virDomainIommufdParseXML(iommufd, ctxt))) + return NULL; + } + if (virDomainDeviceInfoParseXML(xmlopt, node, ctxt, &iommu->info, flags) < 0) return NULL; @@ -16384,6 +16440,25 @@ virDomainIOMMUDefEquals(const virDomainIOMMUDef *a, a->dma_translation != b->dma_translation) return false; + if (a->iommufd && b->iommufd) { + if (a->iommufd->id && b->iommufd->id) { + if (STRNEQ(a->iommufd->id, b->iommufd->id)) + return false; + } else if ((a->iommufd->id && !b->iommufd->id) || + (!a->iommufd->id && b->iommufd->id)) { + return false; + } + if (a->iommufd->fd && b->iommufd->fd) { + if (STRNEQ(a->iommufd->fd, b->iommufd->fd)) + return false; + } else if ((a->iommufd->fd && !b->iommufd->fd) || + (!a->iommufd->fd && b->iommufd->fd)) { + return false; + } + } else { + return false; + } + switch (a->info.type) { case VIR_DOMAIN_DEVICE_ADDRESS_TYPE_PCI: if (a->info.addr.pci.domain != b->info.addr.pci.domain || @@ -22078,6 +22153,27 @@ virDomainIOMMUDefCheckABIStability(virDomainIOMMUDef *src, virTristateSwitchTypeToString(src->dma_translation)); return false; } + if (src->iommufd && dst->iommufd) { + if (STRNEQ(src->iommufd->id, dst->iommufd->id)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Target domain IOMMU device iommufd id '%1$s' does not match source '%2$s'"), + dst->iommufd->id, + src->iommufd->id); + return false; + } + if (STRNEQ(src->iommufd->fd, dst->iommufd->fd)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Target domain IOMMU device iommufd fd '%1$s' does not match source '%2$s'"), + dst->iommufd->fd, + src->iommufd->fd); + return false; + } + } else if ((src->iommufd && !dst->iommufd) || + (!src->iommufd && dst->iommufd)) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Target domain IOMMU device iommufd does not match source")); + return false; + } return virDomainDeviceInfoCheckABIStability(&src->info, &dst->info); } @@ -28310,6 +28406,16 @@ virDomainIOMMUDefFormat(virBuffer *buf, virXMLFormatElement(&childBuf, "driver", &driverAttrBuf, NULL); + if (iommu->iommufd) { + virBufferAddLit(&childBuf, "<iommufd>\n"); + virBufferAdjustIndent(&childBuf, 2); + virBufferAsprintf(&childBuf, "<id>%s</id>\n", iommu->iommufd->id); + if (iommu->iommufd->fd) + virBufferAsprintf(&childBuf, "<fd>%s</fd>\n", iommu->iommufd->fd); + virBufferAdjustIndent(&childBuf, -2); + virBufferAddLit(&childBuf, "</iommufd>\n"); + } + virDomainDeviceInfoFormat(&childBuf, &iommu->info, 0); virBufferAsprintf(&attrBuf, " model='%s'", diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 75568ff50a..ce447e9823 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -3010,6 +3010,11 @@ typedef enum { VIR_DOMAIN_IOMMU_MODEL_LAST } virDomainIOMMUModel; +struct _virDomainIommufdDef { + char *id; + char *fd; +}; + struct _virDomainIOMMUDef { virDomainIOMMUModel model; virTristateSwitch intremap; @@ -3017,6 +3022,8 @@ struct _virDomainIOMMUDef { virTristateSwitch eim; virTristateSwitch iotlb; unsigned int aw_bits; + virDomainIommufdDef *iommufd; + virDomainDeviceInfo info; virTristateSwitch dma_translation; }; @@ -3747,6 +3754,7 @@ virDomainVideoDef *virDomainVideoDefNew(virDomainXMLOption *xmlopt); void virDomainVideoDefFree(virDomainVideoDef *def); G_DEFINE_AUTOPTR_CLEANUP_FUNC(virDomainVideoDef, virDomainVideoDefFree); void virDomainVideoDefClear(virDomainVideoDef *def); +void virDomainIommufdDefFree(virDomainIommufdDef *def); virDomainHostdevDef *virDomainHostdevDefNew(void); void virDomainHostdevDefFree(virDomainHostdevDef *def); void virDomainHubDefFree(virDomainHubDef *def); diff --git a/src/conf/domain_validate.c b/src/conf/domain_validate.c index 4861b1c002..88649ba0b9 100644 --- a/src/conf/domain_validate.c +++ b/src/conf/domain_validate.c @@ -1838,9 +1838,9 @@ virDomainDefCputuneValidate(const virDomainDef *def) static int virDomainDefIOMMUValidate(const virDomainDef *def) { - size_t i; + size_t i, j; - if (!def->iommu) + if (!def->iommu || def->niommus == 0) return 0; for (i = 0; i < def->niommus; i++) { @@ -1851,6 +1851,39 @@ virDomainDefIOMMUValidate(const virDomainDef *def) _("IOMMU model smmuv3Dev must be specified for multiple IOMMU definitions")); } + for (j = i + 1; j < def->niommus; j++) { + if (virDomainIOMMUDefEquals(iommu, + def->iommu[j])) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("IOMMU already exists in the domain configuration")); + return -1; + } + + if (iommu->iommufd && def->iommu[j]->iommufd) { + if (iommu->iommufd->id && def->iommu[j]->iommufd->id) { + if (STRNEQ(iommu->iommufd->id, def->iommu[j]->iommufd->id)) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("iommufd ID must be the same for multiple IOMMUs")); + return -1; + } + } else { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("iommufd ID must be specified")); + } + if (iommu->iommufd->fd && def->iommu[j]->iommufd->fd && + STRNEQ(iommu->iommufd->fd, def->iommu[j]->iommufd->fd)) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("iommufd FD must be the same for multiple IOMMUs")); + return -1; + } + } else if ((iommu->iommufd && !def->iommu[j]->iommufd) || + (!iommu->iommufd && def->iommu[j]->iommufd)) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("The same iommufd configuration must be specified for multiple IOMMUs")); + return -1; + } + } + if (iommu->intremap == VIR_TRISTATE_SWITCH_ON && def->features[VIR_DOMAIN_FEATURE_IOAPIC] != VIR_DOMAIN_IOAPIC_QEMU) { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", @@ -3115,6 +3148,13 @@ virDomainIOMMUDefValidate(const virDomainIOMMUDef *iommu) break; } + if (iommu->iommufd) { + if (!iommu->iommufd->id) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("iommufd must have an associated id")); + return -1; + } + } return 0; } diff --git a/src/conf/schemas/domaincommon.rng b/src/conf/schemas/domaincommon.rng index 8d1768b24f..41db338c71 100644 --- a/src/conf/schemas/domaincommon.rng +++ b/src/conf/schemas/domaincommon.rng @@ -6231,6 +6231,20 @@ <optional> <ref name="acpi"/> </optional> + <optional> + <element name="iommufd"> + <interleave> + <element name="id"> + <text/> + </element> + <optional> + <element name="fd"> + <text/> + </element> + </optional> + </interleave> + </element> + </optional> <optional> <ref name="address"/> </optional> diff --git a/src/conf/virconftypes.h b/src/conf/virconftypes.h index c70437bc05..47392154a4 100644 --- a/src/conf/virconftypes.h +++ b/src/conf/virconftypes.h @@ -142,6 +142,8 @@ typedef struct _virDomainHubDef virDomainHubDef; typedef struct _virDomainHugePage virDomainHugePage; +typedef struct _virDomainIommufdDef virDomainIommufdDef; + typedef struct _virDomainIOMMUDef virDomainIOMMUDef; typedef struct _virDomainIOThreadIDDef virDomainIOThreadIDDef; diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 9deafefaad..d683d0eef7 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -6166,6 +6166,19 @@ qemuBuildIOMMUCommandLine(virCommand *cmd, if (!def->iommu || def->niommus <= 0) return 0; + if (def->iommu[0]->iommufd) { + if (qemuMonitorCreateObjectProps(&props, "iommufd", + def->iommu[0]->iommufd->id, + "S:fd", def->iommu[0]->iommufd->fd, + NULL) < 0) + return -1; + + if (qemuBuildObjectCommandlineFromJSON(cmd, props) < 0) + return -1; + } + + props = NULL; + for (i = 0; i < def->niommus; i++) { virDomainIOMMUDef *iommu = def->iommu[i]; switch (iommu->model) { -- 2.43.0

On Thu, May 15, 2025 at 01:36:40PM -0700, Nathan Chen via Devel wrote:
Add support for an "iommufd" virDomainIOMMUDef member that will be translated to a qemu "-object iommufd" command line argument. This iommufd struct has an "id" member to specify the iommufd object id that can be associated with a passthrough device, and an "fd" member to specify the fd for externally opening the /dev/iommu and VFIO cdev by a management layer.
@@ -28310,6 +28406,16 @@ virDomainIOMMUDefFormat(virBuffer *buf,
virXMLFormatElement(&childBuf, "driver", &driverAttrBuf, NULL);
+ if (iommu->iommufd) { + virBufferAddLit(&childBuf, "<iommufd>\n"); + virBufferAdjustIndent(&childBuf, 2); + virBufferAsprintf(&childBuf, "<id>%s</id>\n", iommu->iommufd->id); + if (iommu->iommufd->fd) + virBufferAsprintf(&childBuf, "<fd>%s</fd>\n", iommu->iommufd->fd);
I'm not convinced we should be exposing the QEMU "id" in the XML. It is an identifier libvirt uses internally for configuring QEMU, but I don't see a reason why the end user needs to control its name. I'm similarly not sure we should expose an FD to users eitherm, just use that internally when talking to QEMU. IOW, why aren't we just having "iommu='yes'" as an attribute on the <iommu> device, avoiding all the logic below that tries to force all devices to use the same configuration.
+ virBufferAdjustIndent(&childBuf, -2); + virBufferAddLit(&childBuf, "</iommufd>\n"); + } + virDomainDeviceInfoFormat(&childBuf, &iommu->info, 0);
virBufferAsprintf(&attrBuf, " model='%s'", diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 75568ff50a..ce447e9823 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -3010,6 +3010,11 @@ typedef enum { VIR_DOMAIN_IOMMU_MODEL_LAST } virDomainIOMMUModel;
+struct _virDomainIommufdDef { + char *id; + char *fd; +}; + struct _virDomainIOMMUDef { virDomainIOMMUModel model; virTristateSwitch intremap; @@ -3017,6 +3022,8 @@ struct _virDomainIOMMUDef { virTristateSwitch eim; virTristateSwitch iotlb; unsigned int aw_bits; + virDomainIommufdDef *iommufd; + virDomainDeviceInfo info; virTristateSwitch dma_translation; }; @@ -3747,6 +3754,7 @@ virDomainVideoDef *virDomainVideoDefNew(virDomainXMLOption *xmlopt); void virDomainVideoDefFree(virDomainVideoDef *def); G_DEFINE_AUTOPTR_CLEANUP_FUNC(virDomainVideoDef, virDomainVideoDefFree); void virDomainVideoDefClear(virDomainVideoDef *def); +void virDomainIommufdDefFree(virDomainIommufdDef *def); virDomainHostdevDef *virDomainHostdevDefNew(void); void virDomainHostdevDefFree(virDomainHostdevDef *def); void virDomainHubDefFree(virDomainHubDef *def); diff --git a/src/conf/domain_validate.c b/src/conf/domain_validate.c index 4861b1c002..88649ba0b9 100644 --- a/src/conf/domain_validate.c +++ b/src/conf/domain_validate.c @@ -1838,9 +1838,9 @@ virDomainDefCputuneValidate(const virDomainDef *def) static int virDomainDefIOMMUValidate(const virDomainDef *def) { - size_t i; + size_t i, j;
- if (!def->iommu) + if (!def->iommu || def->niommus == 0) return 0;
for (i = 0; i < def->niommus; i++) { @@ -1851,6 +1851,39 @@ virDomainDefIOMMUValidate(const virDomainDef *def) _("IOMMU model smmuv3Dev must be specified for multiple IOMMU definitions")); }
+ for (j = i + 1; j < def->niommus; j++) { + if (virDomainIOMMUDefEquals(iommu, + def->iommu[j])) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("IOMMU already exists in the domain configuration")); + return -1; + } + + if (iommu->iommufd && def->iommu[j]->iommufd) { + if (iommu->iommufd->id && def->iommu[j]->iommufd->id) { + if (STRNEQ(iommu->iommufd->id, def->iommu[j]->iommufd->id)) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("iommufd ID must be the same for multiple IOMMUs")); + return -1; + } + } else { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("iommufd ID must be specified")); + } + if (iommu->iommufd->fd && def->iommu[j]->iommufd->fd && + STRNEQ(iommu->iommufd->fd, def->iommu[j]->iommufd->fd)) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("iommufd FD must be the same for multiple IOMMUs")); + return -1; + } + } else if ((iommu->iommufd && !def->iommu[j]->iommufd) || + (!iommu->iommufd && def->iommu[j]->iommufd)) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("The same iommufd configuration must be specified for multiple IOMMUs")); + return -1; + } + } + if (iommu->intremap == VIR_TRISTATE_SWITCH_ON && def->features[VIR_DOMAIN_FEATURE_IOAPIC] != VIR_DOMAIN_IOAPIC_QEMU) { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", @@ -3115,6 +3148,13 @@ virDomainIOMMUDefValidate(const virDomainIOMMUDef *iommu) break; }
+ if (iommu->iommufd) { + if (!iommu->iommufd->id) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("iommufd must have an associated id")); + return -1; + } + } return 0; }
diff --git a/src/conf/schemas/domaincommon.rng b/src/conf/schemas/domaincommon.rng index 8d1768b24f..41db338c71 100644 --- a/src/conf/schemas/domaincommon.rng +++ b/src/conf/schemas/domaincommon.rng @@ -6231,6 +6231,20 @@ <optional> <ref name="acpi"/> </optional> + <optional> + <element name="iommufd"> + <interleave> + <element name="id"> + <text/> + </element> + <optional> + <element name="fd"> + <text/> + </element> + </optional> + </interleave> + </element> + </optional> <optional> <ref name="address"/> </optional> diff --git a/src/conf/virconftypes.h b/src/conf/virconftypes.h index c70437bc05..47392154a4 100644 --- a/src/conf/virconftypes.h +++ b/src/conf/virconftypes.h @@ -142,6 +142,8 @@ typedef struct _virDomainHubDef virDomainHubDef;
typedef struct _virDomainHugePage virDomainHugePage;
+typedef struct _virDomainIommufdDef virDomainIommufdDef; + typedef struct _virDomainIOMMUDef virDomainIOMMUDef;
typedef struct _virDomainIOThreadIDDef virDomainIOThreadIDDef; diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 9deafefaad..d683d0eef7 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -6166,6 +6166,19 @@ qemuBuildIOMMUCommandLine(virCommand *cmd, if (!def->iommu || def->niommus <= 0) return 0;
+ if (def->iommu[0]->iommufd) { + if (qemuMonitorCreateObjectProps(&props, "iommufd", + def->iommu[0]->iommufd->id, + "S:fd", def->iommu[0]->iommufd->fd, + NULL) < 0) + return -1; + + if (qemuBuildObjectCommandlineFromJSON(cmd, props) < 0) + return -1; + } + + props = NULL; + for (i = 0; i < def->niommus; i++) { virDomainIOMMUDef *iommu = def->iommu[i]; switch (iommu->model) { -- 2.43.0
With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 5/20/2025 6:01 AM, Daniel P. Berrangé wrote:
Add support for an "iommufd" virDomainIOMMUDef member that will be translated to a qemu "-object iommufd" command line argument. This iommufd struct has an "id" member to specify the iommufd object id that can be associated with a passthrough device, and an "fd" member to specify the fd for externally opening the /dev/iommu and VFIO cdev by a management layer.
@@ -28310,6 +28406,16 @@ virDomainIOMMUDefFormat(virBuffer *buf,
virXMLFormatElement(&childBuf, "driver", &driverAttrBuf, NULL);
+ if (iommu->iommufd) { + virBufferAddLit(&childBuf, "<iommufd>\n"); + virBufferAdjustIndent(&childBuf, 2); + virBufferAsprintf(&childBuf, "<id>%s</id>\n", iommu->iommufd->id); + if (iommu->iommufd->fd) + virBufferAsprintf(&childBuf, "<fd>%s</fd>\n", iommu->iommufd->fd); I'm not convinced we should be exposing the QEMU "id" in the XML. It is an identifier libvirt uses internally for configuring QEMU, but I don't see a reason why the end user needs to control its name. I'm similarly not sure we should expose an FD to users eitherm, just use that internally when talking to QEMU.
IOW, why aren't we just having "iommu='yes'" as an attribute on the <iommu> device, avoiding all the logic below that tries to force all devices to use the same configuration.
That makes sense - I will simply have an attribute like "iommufd='yes'" as an attribute in the next revision instead of having users specify the iommufd id themselves. As for exposing an FD to users, this is meant to mirror the qemu implementation where users can specify the FD string [0], so I think we should keep this part as-is. [0] https://www.qemu.org/docs/master/devel/vfio-iommufd.html Thanks, Nathan

On Tue, May 27, 2025 at 06:51:50PM -0700, Nathan Chen wrote:
On 5/20/2025 6:01 AM, Daniel P. Berrangé wrote:
Add support for an "iommufd" virDomainIOMMUDef member that will be translated to a qemu "-object iommufd" command line argument. This iommufd struct has an "id" member to specify the iommufd object id that can be associated with a passthrough device, and an "fd" member to specify the fd for externally opening the /dev/iommu and VFIO cdev by a management layer.
@@ -28310,6 +28406,16 @@ virDomainIOMMUDefFormat(virBuffer *buf, virXMLFormatElement(&childBuf, "driver", &driverAttrBuf, NULL); + if (iommu->iommufd) { + virBufferAddLit(&childBuf, "<iommufd>\n"); + virBufferAdjustIndent(&childBuf, 2); + virBufferAsprintf(&childBuf, "<id>%s</id>\n", iommu->iommufd->id); + if (iommu->iommufd->fd) + virBufferAsprintf(&childBuf, "<fd>%s</fd>\n", iommu->iommufd->fd); I'm not convinced we should be exposing the QEMU "id" in the XML. It is an identifier libvirt uses internally for configuring QEMU, but I don't see a reason why the end user needs to control its name. I'm similarly not sure we should expose an FD to users eitherm, just use that internally when talking to QEMU.
IOW, why aren't we just having "iommu='yes'" as an attribute on the <iommu> device, avoiding all the logic below that tries to force all devices to use the same configuration.
That makes sense - I will simply have an attribute like "iommufd='yes'" as an attribute in the next revision instead of having users specify the iommufd id themselves.
As for exposing an FD to users, this is meant to mirror the qemu implementation where users can specify the FD string [0], so I think we should keep this part as-is.
The FD passing functionality exposed by QEMU is targetted at mgmt apps like libvirt, which apply sandboxing to QEMU, and thus need to pass in pre-opened FDs. Most of the time, this is not something that libvirt will expose to its own API/XML users, as FDs are a very low level concept that is backend implementation specific. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 5/30/2025 6:47 AM, Daniel P. Berrangé wrote:
On 5/20/2025 6:01 AM, Daniel P. Berrangé wrote:
Add support for an "iommufd" virDomainIOMMUDef member that will be translated to a qemu "-object iommufd" command line argument. This iommufd struct has an "id" member to specify the iommufd object id that can be associated with a passthrough device, and an "fd" member to specify the fd for externally opening the /dev/iommu and VFIO cdev by a management layer.
@@ -28310,6 +28406,16 @@ virDomainIOMMUDefFormat(virBuffer *buf, virXMLFormatElement(&childBuf, "driver", &driverAttrBuf, NULL); + if (iommu->iommufd) { + virBufferAddLit(&childBuf, "<iommufd>\n"); + virBufferAdjustIndent(&childBuf, 2); + virBufferAsprintf(&childBuf, "<id>%s</id>\n", iommu->iommufd->id); + if (iommu->iommufd->fd) + virBufferAsprintf(&childBuf, "<fd>%s</fd>\n", iommu->iommufd->fd); I'm not convinced we should be exposing the QEMU "id" in the XML. It is an identifier libvirt uses internally for configuring QEMU, but I don't see a reason why the end user needs to control its name. I'm similarly not sure we should expose an FD to users eitherm, just use that internally when talking to QEMU.
IOW, why aren't we just having "iommu='yes'" as an attribute on the <iommu> device, avoiding all the logic below that tries to force all devices to use the same configuration. That makes sense - I will simply have an attribute like "iommufd='yes'" as an attribute in the next revision instead of having users specify the iommufd id themselves.
As for exposing an FD to users, this is meant to mirror the qemu implementation where users can specify the FD string [0], so I think we should keep this part as-is.
The FD passing functionality exposed by QEMU is targetted at mgmt apps like libvirt, which apply sandboxing to QEMU, and thus need to pass in pre-opened FDs. Most of the time, this is not something that libvirt will expose to its own API/XML users, as FDs are a very low level concept that is backend implementation specific.
I see, I will look into implementing logic to pre-open iommufd FDs from libvirt, and keep this un-exposed to the API/XML users. Pre-opening these FDs may come into play in resolving the permissions issues I was encountering in the cover letter as well. Thanks, Nathan

Implement "iommufdId" and "iommufdFd" attributes for "hostdev" devices that can be used to specify associated iommufd object and fd for externally opening the /dev/iommu and VFIO cdev, when launching a qemu VM. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/conf/domain_conf.c | 31 +++++++++++++++++++++++++++++++ src/conf/domain_conf.h | 2 ++ src/conf/domain_validate.c | 14 ++++++++++++++ src/conf/schemas/domaincommon.rng | 17 +++++++++++++++++ src/qemu/qemu_command.c | 2 ++ 5 files changed, 66 insertions(+) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 9330c47b7a..47b7e9acb1 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -13582,6 +13582,19 @@ virDomainVideoDefParseXML(virDomainXMLOption *xmlopt, return g_steal_pointer(&def); } +static void +virDomainHostdevDefIommufdParseXML(xmlXPathContextPtr ctxt, + char** iommufdId, + char** iommufdFd) +{ + g_autofree char *iommufdIdtmp = virXPathString("string(./iommufdId)", ctxt); + g_autofree char *iommufdFdtmp = virXPathString("string(./iommufdFd)", ctxt); + if (iommufdIdtmp) + *iommufdId = g_steal_pointer(&iommufdIdtmp); + if (iommufdFdtmp) + *iommufdFd = g_steal_pointer(&iommufdFdtmp); +} + static virDomainHostdevDef * virDomainHostdevDefParseXML(virDomainXMLOption *xmlopt, xmlNodePtr node, @@ -13656,6 +13669,8 @@ virDomainHostdevDefParseXML(virDomainXMLOption *xmlopt, if (virDomainNetTeamingInfoParseXML(ctxt, &def->teaming) < 0) goto error; + virDomainHostdevDefIommufdParseXML(ctxt, &def->iommufdId, &def->iommufdFd); + return def; error: @@ -21193,6 +21208,16 @@ virDomainHostdevDefCheckABIStability(virDomainHostdevDef *src, } } + if (src->iommufdId && dst->iommufdId) { + if (STRNEQ(src->iommufdId, dst->iommufdId)) + return false; + } + + if (src->iommufdFd && dst->iommufdFd) { + if (STRNEQ(src->iommufdFd, dst->iommufdFd)) + return false; + } + if (!virDomainDeviceInfoCheckABIStability(src->info, dst->info)) return false; @@ -27511,6 +27536,12 @@ virDomainHostdevDefFormat(virBuffer *buf, if (def->shareable) virBufferAddLit(buf, "<shareable/>\n"); + if (def->iommufdId) { + virBufferAsprintf(buf, "<iommufdId>%s</iommufdId>\n", def->iommufdId); + if (def->iommufdFd) + virBufferAsprintf(buf, "<iommufdFd>%s</iommufdFd>\n", def->iommufdFd); + } + virDomainDeviceInfoFormat(buf, def->info, flags | VIR_DOMAIN_DEF_FORMAT_ALLOW_BOOT | VIR_DOMAIN_DEF_FORMAT_ALLOW_ROM); diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index ce447e9823..78507d78b7 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -375,6 +375,8 @@ struct _virDomainHostdevDef { virDomainHostdevCaps caps; } source; virDomainNetTeamingInfo *teaming; + char *iommufdId; + char *iommufdFd; virDomainDeviceInfo *info; /* Guest address */ }; diff --git a/src/conf/domain_validate.c b/src/conf/domain_validate.c index 88649ba0b9..292398b115 100644 --- a/src/conf/domain_validate.c +++ b/src/conf/domain_validate.c @@ -1401,6 +1401,20 @@ virDomainDefHostdevValidate(const virDomainDef *def) ramfbEnabled = true; } } + + /* Check for unsupported conditions: + * - A hostdev's iommufd fd number is defined but its iommufd id + * number is not + * - An iommu device is defined, iommufd is not defined, + * but a hostdev's iommufd id or fd are defined */ + if ((!dev->iommufdId && dev->iommufdFd) || + (def->iommu && def->niommus > 0 && + !def->iommu[0]->iommufd && + (dev->iommufdId || dev->iommufdFd))) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("Unsupported hostdev iommufd configuration")); + return -1; + } } return 0; diff --git a/src/conf/schemas/domaincommon.rng b/src/conf/schemas/domaincommon.rng index 41db338c71..6c1f901eb4 100644 --- a/src/conf/schemas/domaincommon.rng +++ b/src/conf/schemas/domaincommon.rng @@ -6442,6 +6442,12 @@ <optional> <ref name="address"/> </optional> + <optional> + <ref name="iommufdId"/> + </optional> + <optional> + <ref name="iommufdFd"/> + </optional> <optional> <element name="readonly"> <empty/> @@ -7809,6 +7815,17 @@ </element> </define> + <define name="iommufdId"> + <element name="iommufdId"> + <text/> + </element> + </define> + <define name="iommufdFd"> + <element name="iommufdFd"> + <text/> + </element> + </define> + <define name="deviceBoot"> <element name="boot"> <attribute name="order"> diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index d683d0eef7..3a8b71a00a 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -4724,6 +4724,8 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, "S:failover_pair_id", failover_pair_id, "S:display", qemuOnOffAuto(pcisrc->display), "B:ramfb", ramfb, + "S:iommufd", dev->iommufdId, + "S:fd", dev->iommufdFd, NULL) < 0) return NULL; -- 2.43.0

On Thu, May 15, 2025 at 01:36:41PM -0700, Nathan Chen via Devel wrote:
Implement "iommufdId" and "iommufdFd" attributes for "hostdev" devices that can be used to specify associated iommufd object and fd for externally opening the /dev/iommu and VFIO cdev, when launching a qemu VM.
What does it mean if we enable iommufd on the 'hostdev' without enabling iommufd on the 'iommu' device, or vica-verca ? IMHO it seems like we should be enabling iommufd once only.
Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/conf/domain_conf.c | 31 +++++++++++++++++++++++++++++++ src/conf/domain_conf.h | 2 ++ src/conf/domain_validate.c | 14 ++++++++++++++ src/conf/schemas/domaincommon.rng | 17 +++++++++++++++++ src/qemu/qemu_command.c | 2 ++ 5 files changed, 66 insertions(+)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 9330c47b7a..47b7e9acb1 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -13582,6 +13582,19 @@ virDomainVideoDefParseXML(virDomainXMLOption *xmlopt, return g_steal_pointer(&def); }
+static void +virDomainHostdevDefIommufdParseXML(xmlXPathContextPtr ctxt, + char** iommufdId, + char** iommufdFd) +{ + g_autofree char *iommufdIdtmp = virXPathString("string(./iommufdId)", ctxt); + g_autofree char *iommufdFdtmp = virXPathString("string(./iommufdFd)", ctxt); + if (iommufdIdtmp) + *iommufdId = g_steal_pointer(&iommufdIdtmp); + if (iommufdFdtmp) + *iommufdFd = g_steal_pointer(&iommufdFdtmp); +} + static virDomainHostdevDef * virDomainHostdevDefParseXML(virDomainXMLOption *xmlopt, xmlNodePtr node, @@ -13656,6 +13669,8 @@ virDomainHostdevDefParseXML(virDomainXMLOption *xmlopt, if (virDomainNetTeamingInfoParseXML(ctxt, &def->teaming) < 0) goto error;
+ virDomainHostdevDefIommufdParseXML(ctxt, &def->iommufdId, &def->iommufdFd); + return def;
error: @@ -21193,6 +21208,16 @@ virDomainHostdevDefCheckABIStability(virDomainHostdevDef *src, } }
+ if (src->iommufdId && dst->iommufdId) { + if (STRNEQ(src->iommufdId, dst->iommufdId)) + return false; + } + + if (src->iommufdFd && dst->iommufdFd) { + if (STRNEQ(src->iommufdFd, dst->iommufdFd)) + return false; + } + if (!virDomainDeviceInfoCheckABIStability(src->info, dst->info)) return false;
@@ -27511,6 +27536,12 @@ virDomainHostdevDefFormat(virBuffer *buf, if (def->shareable) virBufferAddLit(buf, "<shareable/>\n");
+ if (def->iommufdId) { + virBufferAsprintf(buf, "<iommufdId>%s</iommufdId>\n", def->iommufdId); + if (def->iommufdFd) + virBufferAsprintf(buf, "<iommufdFd>%s</iommufdFd>\n", def->iommufdFd); + } + virDomainDeviceInfoFormat(buf, def->info, flags | VIR_DOMAIN_DEF_FORMAT_ALLOW_BOOT | VIR_DOMAIN_DEF_FORMAT_ALLOW_ROM);
diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index ce447e9823..78507d78b7 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -375,6 +375,8 @@ struct _virDomainHostdevDef { virDomainHostdevCaps caps; } source; virDomainNetTeamingInfo *teaming; + char *iommufdId; + char *iommufdFd; virDomainDeviceInfo *info; /* Guest address */ };
diff --git a/src/conf/domain_validate.c b/src/conf/domain_validate.c index 88649ba0b9..292398b115 100644 --- a/src/conf/domain_validate.c +++ b/src/conf/domain_validate.c @@ -1401,6 +1401,20 @@ virDomainDefHostdevValidate(const virDomainDef *def) ramfbEnabled = true; } } + + /* Check for unsupported conditions: + * - A hostdev's iommufd fd number is defined but its iommufd id + * number is not + * - An iommu device is defined, iommufd is not defined, + * but a hostdev's iommufd id or fd are defined */ + if ((!dev->iommufdId && dev->iommufdFd) || + (def->iommu && def->niommus > 0 && + !def->iommu[0]->iommufd && + (dev->iommufdId || dev->iommufdFd))) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("Unsupported hostdev iommufd configuration")); + return -1; + } }
return 0; diff --git a/src/conf/schemas/domaincommon.rng b/src/conf/schemas/domaincommon.rng index 41db338c71..6c1f901eb4 100644 --- a/src/conf/schemas/domaincommon.rng +++ b/src/conf/schemas/domaincommon.rng @@ -6442,6 +6442,12 @@ <optional> <ref name="address"/> </optional> + <optional> + <ref name="iommufdId"/> + </optional> + <optional> + <ref name="iommufdFd"/> + </optional> <optional> <element name="readonly"> <empty/> @@ -7809,6 +7815,17 @@ </element> </define>
+ <define name="iommufdId"> + <element name="iommufdId"> + <text/> + </element> + </define> + <define name="iommufdFd"> + <element name="iommufdFd"> + <text/> + </element> + </define> + <define name="deviceBoot"> <element name="boot"> <attribute name="order"> diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index d683d0eef7..3a8b71a00a 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -4724,6 +4724,8 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, "S:failover_pair_id", failover_pair_id, "S:display", qemuOnOffAuto(pcisrc->display), "B:ramfb", ramfb, + "S:iommufd", dev->iommufdId, + "S:fd", dev->iommufdFd, NULL) < 0) return NULL;
-- 2.43.0
With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 5/20/2025 6:05 AM, Daniel P. Berrangé wrote:
Implement "iommufdId" and "iommufdFd" attributes for "hostdev" devices that can be used to specify associated iommufd object and fd for externally opening the /dev/iommu and VFIO cdev, when launching a qemu VM. What does it mean if we enable iommufd on the 'hostdev' without enabling iommufd on the 'iommu' device, or vica-verca ?
There are checks in this series that prevent enabling iommufd on the hostdev without enabling iommufd on the iommu device as it would not make sense to link a hostdev with a non-existent iommufd object. If enabling iommufd on the iommu device and not enabling iommufd on a hostdev, qemu will fall back to legacy VFIO behavior for the hostdev and the iommufd object will not be used for passthrough device mappings.
IMHO it seems like we should be enabling iommufd once only.
I can see why we should be enabling iommufd once only for the initial iommufd object, but we still need to provide an option to select legacy VFIO or iommufd for each hostdev definition. And if we would like to enable iommufd once only for the initial iommufd object, would it make more sense to move the iommufd attribute outside of the <iommu> stanza to account for the case where we have multiple <iommu> definitions? Thanks, Nathan

Allow access to /dev/iommu and /dev/vfio/devices/vfio* when launching a qemu VM with iommufd feature enabled. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/qemu/qemu_cgroup.c | 47 +++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_cgroup.h | 1 + src/qemu/qemu_namespace.c | 36 ++++++++++++++++++++++++++++++ 3 files changed, 84 insertions(+) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 48af467bf9..df4e73ad15 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -462,6 +462,47 @@ qemuTeardownInputCgroup(virDomainObj *vm, } +int +qemuSetupIommufdCgroup(virDomainObj *vm) +{ + qemuDomainObjPrivate *priv = vm->privateData; + g_autoptr(DIR) dir = NULL; + struct dirent *dent; + g_autofree char *path = NULL; + + if (vm->def->iommu && vm->def->niommus > 0 && + /* Check if iommufd is enabled */ + vm->def->iommu[0]->iommufd) { + if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_DEVICES)) + return 0; + if (virDirOpen(&dir, "/dev/vfio/devices") < 0) { + if (errno == ENOENT) + return 0; + return -1; + } + while (virDirRead(dir, &dent, "/dev/vfio/devices") > 0) { + if (STRPREFIX(dent->d_name, "vfio")) { + path = g_strdup_printf("/dev/vfio/devices/%s", dent->d_name); + } + if (path && + qemuCgroupAllowDevicePath(vm, path, + VIR_CGROUP_DEVICE_RW, false) < 0) { + return -1; + } + path = NULL; + } + if (virFileExists("/dev/iommu")) + path = g_strdup("/dev/iommu"); + if (path && + qemuCgroupAllowDevicePath(vm, path, + VIR_CGROUP_DEVICE_RW, false) < 0) { + return -1; + } + } + return 0; +} + + /** * qemuSetupHostdevCgroup: * vm: domain object @@ -830,6 +871,12 @@ qemuSetupDevicesCgroup(virDomainObj *vm) return -1; } + if (vm->def->iommu && vm->def->niommus > 0 && + vm->def->iommu[0]->iommufd) { + if (qemuSetupIommufdCgroup(vm) < 0) + return -1; + } + for (i = 0; i < vm->def->nmems; i++) { if (qemuSetupMemoryDevicesCgroup(vm, vm->def->mems[i]) < 0) return -1; diff --git a/src/qemu/qemu_cgroup.h b/src/qemu/qemu_cgroup.h index 3668034cde..bea677ba3c 100644 --- a/src/qemu/qemu_cgroup.h +++ b/src/qemu/qemu_cgroup.h @@ -42,6 +42,7 @@ int qemuSetupHostdevCgroup(virDomainObj *vm, int qemuTeardownHostdevCgroup(virDomainObj *vm, virDomainHostdevDef *dev) G_GNUC_WARN_UNUSED_RESULT; +int qemuSetupIommufdCgroup(virDomainObj *vm); int qemuSetupMemoryDevicesCgroup(virDomainObj *vm, virDomainMemoryDef *mem); int qemuTeardownMemoryDevicesCgroup(virDomainObj *vm, diff --git a/src/qemu/qemu_namespace.c b/src/qemu/qemu_namespace.c index 59421ec9d1..150c6ceae3 100644 --- a/src/qemu/qemu_namespace.c +++ b/src/qemu/qemu_namespace.c @@ -676,6 +676,39 @@ qemuDomainSetupLaunchSecurity(virDomainObj *vm, } +static int +qemuDomainSetupIommufd(virDomainObj *vm, + GSList **paths) +{ + g_autoptr(DIR) dir = NULL; + struct dirent *dent; + g_autofree char *path = NULL; + + /* Check if iommufd is enabled */ + if (vm->def->iommu && vm->def->niommus > 0 && + vm->def->iommu[0]->iommufd) { + if (virDirOpen(&dir, "/dev/vfio/devices") < 0) { + if (errno == ENOENT) + return 0; + return -1; + } + while (virDirRead(dir, &dent, "/dev/vfio/devices") > 0) { + if (STRPREFIX(dent->d_name, "vfio")) { + path = g_strdup_printf("/dev/vfio/devices/%s", dent->d_name); + *paths = g_slist_prepend(*paths, g_steal_pointer(&path)); + } + } + path = NULL; + if (virFileExists("/dev/iommu")) + path = g_strdup("/dev/iommu"); + if (path) + *paths = g_slist_prepend(*paths, g_steal_pointer(&path)); + } + + return 0; +} + + static int qemuNamespaceMknodPaths(virDomainObj *vm, GSList *paths, @@ -699,6 +732,9 @@ qemuDomainBuildNamespace(virQEMUDriverConfig *cfg, if (qemuDomainSetupAllDisks(vm, &paths) < 0) return -1; + if (qemuDomainSetupIommufd(vm, &paths) < 0) + return -1; + if (qemuDomainSetupAllHostdevs(vm, &paths) < 0) return -1; -- 2.43.0

Implement a sub-test in qemuxmlconftest that takes an iommufd attribute from the iommu device stanza, and iommufdId and iommufdFd hostdev attributes in a VM definition. Translate the VM definition to a qemu command line that associates the hostdev withan iommufd object, as well as a fd for externally opening /dev/iommu and VFIO cdev from an external management layer. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- ...fio-iommufd-intel-iommu.x86_64-latest.args | 43 ++++++++++ ...vfio-iommufd-intel-iommu.x86_64-latest.xml | 80 +++++++++++++++++++ .../hostdev-vfio-iommufd-intel-iommu.xml | 80 +++++++++++++++++++ tests/qemuxmlconftest.c | 1 + 4 files changed, 204 insertions(+) create mode 100644 tests/qemuxmlconfdata/hostdev-vfio-iommufd-intel-iommu.x86_64-latest.args create mode 100644 tests/qemuxmlconfdata/hostdev-vfio-iommufd-intel-iommu.x86_64-latest.xml create mode 100644 tests/qemuxmlconfdata/hostdev-vfio-iommufd-intel-iommu.xml diff --git a/tests/qemuxmlconfdata/hostdev-vfio-iommufd-intel-iommu.x86_64-latest.args b/tests/qemuxmlconfdata/hostdev-vfio-iommufd-intel-iommu.x86_64-latest.args new file mode 100644 index 0000000000..aaf54fc93f --- /dev/null +++ b/tests/qemuxmlconfdata/hostdev-vfio-iommufd-intel-iommu.x86_64-latest.args @@ -0,0 +1,43 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/var/lib/libvirt/qemu/domain--1-QEMUGuest2 \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/var/lib/libvirt/qemu/domain--1-QEMUGuest2/.local/share \ +XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain--1-QEMUGuest2/.cache \ +XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain--1-QEMUGuest2/.config \ +/usr/bin/qemu-system-x86_64 \ +-name guest=QEMUGuest2,debug-threads=on \ +-S \ +-object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain--1-QEMUGuest2/master-key.aes"}' \ +-machine q35,usb=off,dump-guest-core=off,memory-backend=pc.ram,acpi=off \ +-accel tcg \ +-cpu qemu64 \ +-m size=219136k \ +-object '{"qom-type":"memory-backend-ram","id":"pc.ram","size":224395264}' \ +-overcommit mem-lock=off \ +-smp 1,sockets=1,cores=1,threads=1 \ +-uuid c7a5fdbd-edaf-9466-926a-d65c16db1809 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,fd=1729,server=on,wait=off \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-boot strict=on \ +-device '{"driver":"pcie-root-port","port":8,"chassis":1,"id":"pci.1","bus":"pcie.0","multifunction":true,"addr":"0x1"}' \ +-device '{"driver":"pcie-root-port","port":9,"chassis":2,"id":"pci.2","bus":"pcie.0","addr":"0x1.0x1"}' \ +-device '{"driver":"qemu-xhci","id":"usb","bus":"pci.1","addr":"0x0"}' \ +-object '{"qom-type":"iommufd","id":"iommufd0","fd":"22"}' \ +-device '{"driver":"intel-iommu","id":"iommu0"}' \ +-audiodev '{"id":"audio1","driver":"none"}' \ +-global ICH9-LPC.noreboot=off \ +-watchdog-action reset \ +-device '{"driver":"vfio-pci","host":"0000:06:12.1","id":"hostdev0","bus":"pcie.0","addr":"0x2"}' \ +-device '{"driver":"vfio-pci","host":"0000:06:12.2","id":"hostdev1","iommufd":"iommufd0","fd":"23","bus":"pcie.0","addr":"0x3"}' \ +-device '{"driver":"vfio-pci","host":"0000:06:12.3","id":"hostdev2","bus":"pcie.0","addr":"0x4"}' \ +-device '{"driver":"vfio-pci","host":"0000:06:12.4","id":"hostdev3","bus":"pcie.0","addr":"0x5"}' \ +-device '{"driver":"virtio-balloon-pci","id":"balloon0","bus":"pcie.0","addr":"0x6"}' \ +-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ +-msg timestamp=on diff --git a/tests/qemuxmlconfdata/hostdev-vfio-iommufd-intel-iommu.x86_64-latest.xml b/tests/qemuxmlconfdata/hostdev-vfio-iommufd-intel-iommu.x86_64-latest.xml new file mode 100644 index 0000000000..4f9cfa2f1d --- /dev/null +++ b/tests/qemuxmlconfdata/hostdev-vfio-iommufd-intel-iommu.x86_64-latest.xml @@ -0,0 +1,80 @@ +<domain type='qemu'> + <name>QEMUGuest2</name> + <uuid>c7a5fdbd-edaf-9466-926a-d65c16db1809</uuid> + <memory unit='KiB'>219100</memory> + <currentMemory unit='KiB'>219100</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='x86_64' machine='q35'>hvm</type> + <boot dev='hd'/> + </os> + <cpu mode='custom' match='exact' check='none'> + <model fallback='forbid'>qemu64</model> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='pci' index='0' model='pcie-root'/> + <controller type='pci' index='1' model='pcie-root-port'> + <model name='pcie-root-port'/> + <target chassis='1' port='0x8'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0' multifunction='on'/> + </controller> + <controller type='pci' index='2' model='pcie-root-port'> + <model name='pcie-root-port'/> + <target chassis='2' port='0x9'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/> + </controller> + <controller type='usb' index='0' model='qemu-xhci'> + <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> + </controller> + <controller type='sata' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/> + </controller> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <audio id='1' type='none'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x1'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x2'/> + </source> + <iommufdId>iommufd0</iommufdId> + <iommufdFd>23</iommufdFd> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver model='vfio-pci-igb'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x3'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio' model='vfio-pci-igb'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x4'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> + </hostdev> + <watchdog model='itco' action='reset'/> + <memballoon model='virtio'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> + </memballoon> + <iommu model='intel'> + <iommufd> + <id>iommufd0</id> + <fd>22</fd> + </iommufd> + </iommu> + </devices> +</domain> diff --git a/tests/qemuxmlconfdata/hostdev-vfio-iommufd-intel-iommu.xml b/tests/qemuxmlconfdata/hostdev-vfio-iommufd-intel-iommu.xml new file mode 100644 index 0000000000..4f9cfa2f1d --- /dev/null +++ b/tests/qemuxmlconfdata/hostdev-vfio-iommufd-intel-iommu.xml @@ -0,0 +1,80 @@ +<domain type='qemu'> + <name>QEMUGuest2</name> + <uuid>c7a5fdbd-edaf-9466-926a-d65c16db1809</uuid> + <memory unit='KiB'>219100</memory> + <currentMemory unit='KiB'>219100</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='x86_64' machine='q35'>hvm</type> + <boot dev='hd'/> + </os> + <cpu mode='custom' match='exact' check='none'> + <model fallback='forbid'>qemu64</model> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='pci' index='0' model='pcie-root'/> + <controller type='pci' index='1' model='pcie-root-port'> + <model name='pcie-root-port'/> + <target chassis='1' port='0x8'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0' multifunction='on'/> + </controller> + <controller type='pci' index='2' model='pcie-root-port'> + <model name='pcie-root-port'/> + <target chassis='2' port='0x9'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/> + </controller> + <controller type='usb' index='0' model='qemu-xhci'> + <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> + </controller> + <controller type='sata' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/> + </controller> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <audio id='1' type='none'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x1'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x2'/> + </source> + <iommufdId>iommufd0</iommufdId> + <iommufdFd>23</iommufdFd> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver model='vfio-pci-igb'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x3'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> + </hostdev> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver name='vfio' model='vfio-pci-igb'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x4'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> + </hostdev> + <watchdog model='itco' action='reset'/> + <memballoon model='virtio'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> + </memballoon> + <iommu model='intel'> + <iommufd> + <id>iommufd0</id> + <fd>22</fd> + </iommufd> + </iommu> + </devices> +</domain> diff --git a/tests/qemuxmlconftest.c b/tests/qemuxmlconftest.c index 1f31ec810c..bf62271d9d 100644 --- a/tests/qemuxmlconftest.c +++ b/tests/qemuxmlconftest.c @@ -2140,6 +2140,7 @@ mymain(void) DO_TEST_CAPS_LATEST_PARSE_ERROR("hostdev-mdev-display-ramfb-multiple"); DO_TEST_CAPS_LATEST_PARSE_ERROR("hostdev-vfio-zpci-wrong-arch"); DO_TEST_CAPS_ARCH_LATEST("hostdev-vfio-zpci", "s390x"); + DO_TEST_CAPS_LATEST("hostdev-vfio-iommufd-intel-iommu"); DO_TEST_CAPS_ARCH_LATEST_PARSE_ERROR("hostdev-vfio-zpci-invalid-uid-valid-fid", "s390x"); DO_TEST_CAPS_ARCH_LATEST("hostdev-vfio-zpci-multidomain-many", "s390x"); DO_TEST_CAPS_ARCH_LATEST("hostdev-vfio-zpci-autogenerate", "s390x"); -- 2.43.0

-----Original Message----- From: Nathan Chen <nathanc@nvidia.com> Sent: Thursday, May 15, 2025 9:37 PM To: devel@lists.libvirt.org Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; nicolinc@nvidia.com; Nathan Chen <nathanc@nvidia.com> Subject: [RFC PATCH 0/5] qemu: Implement support for iommufd and multiple vSMMUs
Hi,
This is a follow up to the first RFC patchset [0] for supporting multiple vSMMU instances in a qemu VM. This patchset also introduces support for using iommufd to propagate DMA mappings to kernel for assigned devices.
This patchset implements support for specifying multiple <iommu> devices within the VM definition when smmuv3Dev IOMMU model is specified, and is tested with Shameer's latest qemu RFC for HW-accelerated vSMMU devices [1]
Based on feedback released on the above RFC and the discussion here[1], there are certain changes to the name of the vSMMU device and the way we associate the PCIe bus. Going forward it is more likely to be something like below, -device arm-smmuv3,primary-bus=pcie.0,accel=on -device vfio-pci,host=xxx,,bus=pcie.0 -device pxb-pcie,id=pcie.1,bus_nr=2 -device arm-smmuv3,primary-bus=pcie.1,accel=on ... Hopefully, this doesn't warrant any major changes to this libvirt series, but please do make a note of it. Thanks, Shameer [0] https://lore.kernel.org/qemu-devel/aB25ZRu7pCJNpamt@redhat.com/
Moreover, it adds a new 'iommufd' member for virDomainIOMMUDef, in order to represent the iommufd object in qemu command line. This patchset also implements new 'iommufdId' and 'iommufdFd' attributes for hostdev devices to be associated with the iommufd object.
For instance, specifying the iommufd object and associated hostdev in a VM definition with multiple IOMMUs, configured to be routed to pcie-expander-bus controllers in a way where VFIO device to SMMUv3 associations are matched with the host (pcie-expander-bus and pcie-root-port controllers are no longer auto-added/auto-routed like in the first revision of this RFC, as the PCIe topology will be configured by management apps):
<devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='252'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='2' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='248'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> ... <controller type='pci' index='21' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='21' port='0x0'/> <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </controller> <controller type='pci' index='22' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='22' port='0xa8'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </controller> ... <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> <iommufdId>iommufd0</iommufdId> <address type='pci' domain='0x0000' bus='0x15' slot='0x00' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0019' bus='0x01' slot='0x00' function='0x0'/> </source> <iommufdId>iommufd0</iommufdId> <address type='pci' domain='0x0000' bus='0x16' slot='0x00' function='0x0'/> </hostdev> <iommu model='smmuv3Dev'> <iommufd> <id>iommufd0</id> </iommufd> <address type='pci' domain='0x0000' bus='0x01' slot='0x01' function='0x0'/> </iommu> <iommu model='smmuv3Dev'> <iommufd> <id>iommufd0</id> </iommufd> <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/> </iommu> </devices>
This would get translated to a qemu command line with the arguments below:
-device '{"driver":"pxb- pcie","bus_nr":252,"id":"pci.1","bus":"pcie.0","addr":"0x1"}' \ -device '{"driver":"pxb- pcie","bus_nr":248,"id":"pci.2","bus":"pcie.0","addr":"0x2"}' \ -device '{"driver":"pcie-root- port","port":0,"chassis":21,"id":"pci.21","bus":"pci.1","addr":"0x0"}' \ -device '{"driver":"pcie-root- port","port":168,"chassis":22,"id":"pci.22","bus":"pci.2","addr":"0x0"}' \ -object '{"qom-type":"iommufd","id":"iommufd0"}' \ -device '{"driver":"arm-smmuv3-accel","bus":"pci.1"}' \ -device '{"driver":"arm-smmuv3-accel","bus":"pci.2"}' \ -device '{"driver":"vfio- pci","host":"0009:01:00.0","id":"hostdev0","iommufd":"iommufd0","bus":"pci .21","addr":"0x0"}' \ -device '{"driver":"vfio- pci","host":"0019:01:00.0","id":"hostdev1","iommufd":"iommufd0","bus":"pci .22","addr":"0x0"}' \
If users would like to leverage qemu's iommufd feature to open the VFIO cdev and /dev/iommu via an external management layer, the fd can be specified like so in the VM definition:
<devices> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x06' slot='0x12' function='0x2'/> </source> <iommufdId>iommufd0</iommufdId> <iommufdFd>23</iommufdFd> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </hostdev> <iommu model='intel'> <iommufd> <id>iommufd0</id> <fd>22</fd> </iommufd> </iommu> </devices>
This would get translated to a qemu command line with the arguments below:
-object '{"qom-type":"iommufd","id":"iommufd0","fd":"22"}' \ -device '{"driver":"vfio- pci","host":"0000:06:12.2","id":"hostdev1","iommufd":"iommufd0","fd":"23", "bus":"pci.0","addr":"0x3"}' \
Summary of changes: - Introduced support for specifying multiple <iommu> stanzas in the VM XML definition when using smmuv3Dev model. - Automating PCIe topology to populate VM definition with multiple vSMMUs routed to pcie-expander-bus controllers is excluded, in favor of deferring creation of PXBs and routing of VFIO devices to management apps. - Introduced iommufd support.
TODO: - I updated the namespace and cgroup configuration to allow access to iommufd paths at /dev/vfio/devices/vfio* and /dev/iommu. However, qemu needs to be launched with user and group set to 'root' in order for these paths to be accessible. A passthrough device represented by /dev/vfio/18 normally has 'root' user and group permissions, but in the mount namespace it's changed to 'libvirt-qemu' and 'kvm'. I wasn't able to discern where this is happening by looking at src/qemu/qemu_namespace.c and src/qemu/qemu_cgroup.c. Would you have any pointers on how to change the iommufd paths' user and group permissions in the libvirt mount namespace?
This series is on Github: https://github.com/NathanChenNVIDIA/libvirt/tree/smmuv3Dev-iommufd-04- 15-25
Thanks, Nathan
[0] https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/7GDT6... PAJMPP4ZSC4ACME6GVMG236/ [1] https://lore.kernel.org/qemu-devel/20250311141045.66620-1- shameerali.kolothum.thodi@huawei.com/
Signed-off-by: Nathan Chen <nathanc@nvidia.com>
Nathan Chen (5): conf: Support multiple smmuv3Dev IOMMU devices conf: Add an iommufd member struct to virDomainIOMMUDef qemu: Implement support for associating iommufd to hostdev qemu: Update Cgroup and namespace for qemu to access iommufd paths qemu: Add test case for specifying iommufd
docs/formatdomain.rst | 5 +- src/conf/domain_addr.c | 12 +- src/conf/domain_addr.h | 4 +- src/conf/domain_conf.c | 292 ++++++++++++++++-- src/conf/domain_conf.h | 21 +- src/conf/domain_validate.c | 94 +++++- src/conf/schemas/domaincommon.rng | 37 ++- src/conf/virconftypes.h | 2 + src/libvirt_private.syms | 2 + src/qemu/qemu_alias.c | 15 +- src/qemu/qemu_cgroup.c | 47 +++ src/qemu/qemu_cgroup.h | 1 + src/qemu/qemu_command.c | 146 ++++++--- src/qemu/qemu_domain_address.c | 33 +- src/qemu/qemu_driver.c | 8 +- src/qemu/qemu_namespace.c | 36 +++ src/qemu/qemu_postparse.c | 11 +- src/qemu/qemu_validate.c | 22 +- ...fio-iommufd-intel-iommu.x86_64-latest.args | 43 +++ ...vfio-iommufd-intel-iommu.x86_64-latest.xml | 80 +++++ .../hostdev-vfio-iommufd-intel-iommu.xml | 80 +++++ tests/qemuxmlconftest.c | 1 + 22 files changed, 878 insertions(+), 114 deletions(-) create mode 100644 tests/qemuxmlconfdata/hostdev-vfio-iommufd-intel- iommu.x86_64-latest.args create mode 100644 tests/qemuxmlconfdata/hostdev-vfio-iommufd-intel- iommu.x86_64-latest.xml create mode 100644 tests/qemuxmlconfdata/hostdev-vfio-iommufd-intel- iommu.xml
-- 2.43.0

On 5/16/2025 3:19 AM, Shameerali Kolothum Thodi wrote:
Hi,
This is a follow up to the first RFC patchset [0] for supporting multiple vSMMU instances in a qemu VM. This patchset also introduces support for using iommufd to propagate DMA mappings to kernel for assigned devices.
This patchset implements support for specifying multiple <iommu> devices within the VM definition when smmuv3Dev IOMMU model is specified, and is tested with Shameer's latest qemu RFC for HW-accelerated vSMMU devices [1] Based on feedback released on the above RFC and the discussion here[1], there are certain changes to the name of the vSMMU device and the way we associate the PCIe bus.
Going forward it is more likely to be something like below,
-device arm-smmuv3,primary-bus=pcie.0,accel=on -device vfio-pci,host=xxx,,bus=pcie.0 -device pxb-pcie,id=pcie.1,bus_nr=2 -device arm-smmuv3,primary-bus=pcie.1,accel=on ...
Hopefully, this doesn't warrant any major changes to this libvirt series, but please do make a note of it.
Thanks, Shameer
Thanks Shameer, I will make a note of this for the next revision. Best, Nathan

On Thu, May 15, 2025 at 01:36:38PM -0700, Nathan Chen via Devel wrote:
Hi,
This is a follow up to the first RFC patchset [0] for supporting multiple vSMMU instances in a qemu VM. This patchset also introduces support for using iommufd to propagate DMA mappings to kernel for assigned devices.
This patchset implements support for specifying multiple <iommu> devices within the VM definition when smmuv3Dev IOMMU model is specified, and is tested with Shameer's latest qemu RFC for HW-accelerated vSMMU devices [1]
Moreover, it adds a new 'iommufd' member for virDomainIOMMUDef, in order to represent the iommufd object in qemu command line. This patchset also implements new 'iommufdId' and 'iommufdFd' attributes for hostdev devices to be associated with the iommufd object.
For instance, specifying the iommufd object and associated hostdev in a VM definition with multiple IOMMUs, configured to be routed to pcie-expander-bus controllers in a way where VFIO device to SMMUv3 associations are matched with the host (pcie-expander-bus and pcie-root-port controllers are no longer auto-added/auto-routed like in the first revision of this RFC, as the PCIe topology will be configured by management apps):
<devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='252'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='2' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='248'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> ... <controller type='pci' index='21' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='21' port='0x0'/> <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </controller> <controller type='pci' index='22' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='22' port='0xa8'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </controller> ... <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> <iommufdId>iommufd0</iommufdId> <address type='pci' domain='0x0000' bus='0x15' slot='0x00' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0019' bus='0x01' slot='0x00' function='0x0'/> </source> <iommufdId>iommufd0</iommufdId> <address type='pci' domain='0x0000' bus='0x16' slot='0x00' function='0x0'/> </hostdev> <iommu model='smmuv3Dev'> <iommufd> <id>iommufd0</id> </iommufd> <address type='pci' domain='0x0000' bus='0x01' slot='0x01' function='0x0'/>
IIUC, you're using <address> here to reference the earlier <controller> pcie-expander-bus. This is a bit wierd as it is making it look like the smmuv3Dev itself has a PCI address, but this is just the PCI address of the controller. The smmuv3dev also doesn't have an address on the pcie-expander-bus, it is just an association IIUC. So from this pov, I think I'd be inclined to say we should just reference the <controller> based on its index, using an attribute <iommu model='smmuv3dev' controller='2'/>
</iommu> <iommu model='smmuv3Dev'> <iommufd> <id>iommufd0</id> </iommufd> <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/> </iommu> </devices>
This would get translated to a qemu command line with the arguments below:
-device '{"driver":"pxb-pcie","bus_nr":252,"id":"pci.1","bus":"pcie.0","addr":"0x1"}' \ -device '{"driver":"pxb-pcie","bus_nr":248,"id":"pci.2","bus":"pcie.0","addr":"0x2"}' \ -device '{"driver":"pcie-root-port","port":0,"chassis":21,"id":"pci.21","bus":"pci.1","addr":"0x0"}' \ -device '{"driver":"pcie-root-port","port":168,"chassis":22,"id":"pci.22","bus":"pci.2","addr":"0x0"}' \ -object '{"qom-type":"iommufd","id":"iommufd0"}' \ -device '{"driver":"arm-smmuv3-accel","bus":"pci.1"}' \ -device '{"driver":"arm-smmuv3-accel","bus":"pci.2"}' \ -device '{"driver":"vfio-pci","host":"0009:01:00.0","id":"hostdev0","iommufd":"iommufd0","bus":"pci.21","addr":"0x0"}' \ -device '{"driver":"vfio-pci","host":"0019:01:00.0","id":"hostdev1","iommufd":"iommufd0","bus":"pci.22","addr":"0x0"}' \
The iommufd integration in the XML looks a bit wierd too - we have four different elements all referencing 'iommufd0' but nothing is defining this. The iommu references the iommufd0, but nothing actually uses this on the arm-smuv3-accel command line. I've not been paying much attention to iommufd in QEMU, but IIUC it will apply to x86_64 too. So I'm wondering how iommufd integration sound work in libvirt more broadly.
If users would like to leverage qemu's iommufd feature to open the VFIO cdev and /dev/iommu via an external management layer, the fd can be specified like so in the VM definition:
<devices> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x06' slot='0x12' function='0x2'/> </source> <iommufdId>iommufd0</iommufdId> <iommufdFd>23</iommufdFd> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </hostdev> <iommu model='intel'> <iommufd> <id>iommufd0</id> <fd>22</fd> </iommufd> </iommu> </devices>
This would get translated to a qemu command line with the arguments below:
-object '{"qom-type":"iommufd","id":"iommufd0","fd":"22"}' \ -device '{"driver":"vfio-pci","host":"0000:06:12.2","id":"hostdev1","iommufd":"iommufd0","fd":"23","bus":"pci.0","addr":"0x3"}' \
I'm not getting why we have multiple different FDs here, when we only have a single iommufd for the VMs ?
Summary of changes: - Introduced support for specifying multiple <iommu> stanzas in the VM XML definition when using smmuv3Dev model. - Automating PCIe topology to populate VM definition with multiple vSMMUs routed to pcie-expander-bus controllers is excluded, in favor of deferring creation of PXBs and routing of VFIO devices to management apps. - Introduced iommufd support.
TODO: - I updated the namespace and cgroup configuration to allow access to iommufd paths at /dev/vfio/devices/vfio* and /dev/iommu. However, qemu needs to be launched with user and group set to 'root' in order for these paths to be accessible. A passthrough device represented by /dev/vfio/18 normally has 'root' user and group permissions, but in the mount namespace it's changed to 'libvirt-qemu' and 'kvm'. I wasn't able to discern where this is happening by looking at src/qemu/qemu_namespace.c and src/qemu/qemu_cgroup.c. Would you have any pointers on how to change the iommufd paths' user and group permissions in the libvirt mount namespace?
All permissions are handled by the security managers in src/security, both DAC file permissions/ownership and SELinux labelling. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Hi Daniel, On 5/20/2025 5:51 AM, Daniel P. Berrangé wrote:
Hi,
This is a follow up to the first RFC patchset [0] for supporting multiple vSMMU instances in a qemu VM. This patchset also introduces support for using iommufd to propagate DMA mappings to kernel for assigned devices.
This patchset implements support for specifying multiple <iommu> devices within the VM definition when smmuv3Dev IOMMU model is specified, and is tested with Shameer's latest qemu RFC for HW-accelerated vSMMU devices [1]
Moreover, it adds a new 'iommufd' member for virDomainIOMMUDef, in order to represent the iommufd object in qemu command line. This patchset also implements new 'iommufdId' and 'iommufdFd' attributes for hostdev devices to be associated with the iommufd object.
For instance, specifying the iommufd object and associated hostdev in a VM definition with multiple IOMMUs, configured to be routed to pcie-expander-bus controllers in a way where VFIO device to SMMUv3 associations are matched with the host (pcie-expander-bus and pcie-root-port controllers are no longer auto-added/auto-routed like in the first revision of this RFC, as the PCIe topology will be configured by management apps):
<devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='252'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='2' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='248'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> ... <controller type='pci' index='21' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='21' port='0x0'/> <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </controller> <controller type='pci' index='22' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='22' port='0xa8'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </controller> ... <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> <iommufdId>iommufd0</iommufdId> <address type='pci' domain='0x0000' bus='0x15' slot='0x00' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0019' bus='0x01' slot='0x00' function='0x0'/> </source> <iommufdId>iommufd0</iommufdId> <address type='pci' domain='0x0000' bus='0x16' slot='0x00' function='0x0'/> </hostdev> <iommu model='smmuv3Dev'> <iommufd> <id>iommufd0</id> </iommufd> <address type='pci' domain='0x0000' bus='0x01' slot='0x01' function='0x0'/> IIUC, you're using <address> here to reference the earlier <controller> pcie-expander-bus. This is a bit wierd as it is making it look like the smmuv3Dev itself has a PCI address, but this is just the PCI address of the controller.
The smmuv3dev also doesn't have an address on the pcie-expander-bus, it is just an association IIUC.
So from this pov, I think I'd be inclined to say we should just reference the <controller> based on its index, using an attribute
<iommu model='smmuv3dev' controller='2'/>
I see, I will revise this to reference the controller index instead.
</iommu> <iommu model='smmuv3Dev'> <iommufd> <id>iommufd0</id> </iommufd> <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/> </iommu> </devices>
This would get translated to a qemu command line with the arguments below:
-device '{"driver":"pxb-pcie","bus_nr":252,"id":"pci.1","bus":"pcie.0","addr":"0x1"}' \ -device '{"driver":"pxb-pcie","bus_nr":248,"id":"pci.2","bus":"pcie.0","addr":"0x2"}' \ -device '{"driver":"pcie-root-port","port":0,"chassis":21,"id":"pci.21","bus":"pci.1","addr":"0x0"}' \ -device '{"driver":"pcie-root-port","port":168,"chassis":22,"id":"pci.22","bus":"pci.2","addr":"0x0"}' \ -object '{"qom-type":"iommufd","id":"iommufd0"}' \ -device '{"driver":"arm-smmuv3-accel","bus":"pci.1"}' \ -device '{"driver":"arm-smmuv3-accel","bus":"pci.2"}' \ -device '{"driver":"vfio-pci","host":"0009:01:00.0","id":"hostdev0","iommufd":"iommufd0","bus":"pci.21","addr":"0x0"}' \ -device '{"driver":"vfio-pci","host":"0019:01:00.0","id":"hostdev1","iommufd":"iommufd0","bus":"pci.22","addr":"0x0"}' \
The iommufd integration in the XML looks a bit wierd too - we have four different elements all referencing 'iommufd0' but nothing is defining this. The iommu references the iommufd0, but nothing actually uses this on the arm-smuv3-accel command line.
I've not been paying much attention to iommufd in QEMU, but IIUC it will apply to x86_64 too. So I'm wondering how iommufd integration sound work in libvirt more broadly.
It is my understanding that we want to consider device classes for libvirt device representation in XML, so I intended to have users declare the iommufd definition as an attribute under the <iommu> stanza, which would translate to the following qemu argument: -object '{"qom-type":"iommufd","id":"iommufd0"}' but since this series implements support for multiple <iommu> definitions, we specify iommufd0 for multiple <iommu> stanzas. For x86_64, we would just specify the iommufd attribute once under a single <iommu> stanza. Would you suggest we move iommufd out of the <iommu> definition instead, like the examples below? <domain type='kvm'> ... <devices> <iommufd>iommufd0</iommufd> <iommu model='smmuv3Dev' controller='22'/> </devices> ... </domain> *or* <domain type='kvm'> ... <iommufd>iommufd0</iommufd> <devices> <iommu model='smmuv3Dev' controller='22'/> </devices> ... </domain>
If users would like to leverage qemu's iommufd feature to open the VFIO cdev and /dev/iommu via an external management layer, the fd can be specified like so in the VM definition:
<devices> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x06' slot='0x12' function='0x2'/> </source> <iommufdId>iommufd0</iommufdId> <iommufdFd>23</iommufdFd> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </hostdev> <iommu model='intel'> <iommufd> <id>iommufd0</id> <fd>22</fd> </iommufd> </iommu> </devices>
This would get translated to a qemu command line with the arguments below:
-object '{"qom-type":"iommufd","id":"iommufd0","fd":"22"}' \ -device '{"driver":"vfio-pci","host":"0000:06:12.2","id":"hostdev1","iommufd":"iommufd0","fd":"23","bus":"pci.0","addr":"0x3"}' \ I'm not getting why we have multiple different FDs here, when we only have a single iommufd for the VMs ?
Summary of changes: - Introduced support for specifying multiple <iommu> stanzas in the VM XML definition when using smmuv3Dev model. - Automating PCIe topology to populate VM definition with multiple vSMMUs routed to pcie-expander-bus controllers is excluded, in favor of deferring creation of PXBs and routing of VFIO devices to management apps. - Introduced iommufd support.
TODO: - I updated the namespace and cgroup configuration to allow access to iommufd paths at /dev/vfio/devices/vfio* and /dev/iommu. However, qemu needs to be launched with user and group set to 'root' in order for these paths to be accessible. A passthrough device represented by /dev/vfio/18 normally has 'root' user and group permissions, but in the mount namespace it's changed to 'libvirt-qemu' and 'kvm'. I wasn't able to discern where this is happening by looking at src/qemu/qemu_namespace.c and src/qemu/qemu_cgroup.c. Would you have any pointers on how to change the iommufd paths' user and group permissions in the libvirt mount namespace? All permissions are handled by the security managers in src/security, both DAC file permissions/ownership and SELinux labelling.
Thanks, I will take a look under src/security/ and try to resolve this.
With regards, Daniel
Best, Nathan

Hi Daniel,
On 5/20/2025 5:51 AM, Daniel P. Berrangé wrote:
Hi,
This is a follow up to the first RFC patchset [0] for supporting multiple vSMMU instances in a qemu VM. This patchset also introduces support for using iommufd to propagate DMA mappings to kernel for assigned devices.
This patchset implements support for specifying multiple <iommu> devices within the VM definition when smmuv3Dev IOMMU model is specified, and is tested with Shameer's latest qemu RFC for HW-accelerated vSMMU devices [1]
Moreover, it adds a new 'iommufd' member for virDomainIOMMUDef, in order to represent the iommufd object in qemu command line. This patchset also implements new 'iommufdId' and 'iommufdFd' attributes for hostdev devices to be associated with the iommufd object.
For instance, specifying the iommufd object and associated hostdev in a VM definition with multiple IOMMUs, configured to be routed to pcie-expander-bus controllers in a way where VFIO device to SMMUv3 associations are matched with the host (pcie-expander-bus and pcie-root-port controllers are no longer auto-added/auto-routed like in the first revision of this RFC, as the PCIe topology will be configured by management apps):
<devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='252'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='2' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='248'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> ... <controller type='pci' index='21' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='21' port='0x0'/> <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </controller> <controller type='pci' index='22' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='22' port='0xa8'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </controller> ... <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> <iommufdId>iommufd0</iommufdId> <address type='pci' domain='0x0000' bus='0x15' slot='0x00' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0019' bus='0x01' slot='0x00' function='0x0'/> </source> <iommufdId>iommufd0</iommufdId> <address type='pci' domain='0x0000' bus='0x16' slot='0x00' function='0x0'/> </hostdev> <iommu model='smmuv3Dev'> <iommufd> <id>iommufd0</id> </iommufd> <address type='pci' domain='0x0000' bus='0x01' slot='0x01' function='0x0'/> IIUC, you're using <address> here to reference the earlier <controller> pcie-expander-bus. This is a bit wierd as it is making it look like the smmuv3Dev itself has a PCI address, but this is just the PCI address of the controller.
The smmuv3dev also doesn't have an address on the pcie-expander-bus, it is just an association IIUC.
So from this pov, I think I'd be inclined to say we should just reference the <controller> based on its index, using an attribute
<iommu model='smmuv3dev' controller='2'/>
I see, I will revise this to reference the controller index instead.
</iommu> <iommu model='smmuv3Dev'> <iommufd> <id>iommufd0</id> </iommufd> <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/> </iommu> </devices>
This would get translated to a qemu command line with the arguments below:
-device '{"driver":"pxb-pcie","bus_nr":252,"id":"pci.1","bus":"pcie.0","addr":"0x1"}' \ -device '{"driver":"pxb-pcie","bus_nr":248,"id":"pci.2","bus":"pcie.0","addr":"0x2"}' \ -device '{"driver":"pcie-root-port","port":0,"chassis":21,"id":"pci.21","bus":"pci.1","addr":"0x0"}' \ -device '{"driver":"pcie-root-port","port":168,"chassis":22,"id":"pci.22","bus":"pci.2","addr":"0x0"}' \ -object '{"qom-type":"iommufd","id":"iommufd0"}' \ -device '{"driver":"arm-smmuv3-accel","bus":"pci.1"}' \ -device '{"driver":"arm-smmuv3-accel","bus":"pci.2"}' \ -device '{"driver":"vfio-pci","host":"0009:01:00.0","id":"hostdev0","iommufd":"iommufd0","bus":"pci.21","addr":"0x0"}' \ -device '{"driver":"vfio-pci","host":"0019:01:00.0","id":"hostdev1","iommufd":"iommufd0","bus":"pci.22","addr":"0x0"}' \
The iommufd integration in the XML looks a bit wierd too - we have four different elements all referencing 'iommufd0' but nothing is defining this. The iommu references the iommufd0, but nothing actually uses this on the arm-smuv3-accel command line.
I've not been paying much attention to iommufd in QEMU, but IIUC it will apply to x86_64 too. So I'm wondering how iommufd integration sound work in libvirt more broadly.
It is my understanding that we want to consider device classes for libvirt device representation in XML, so I intended to have users declare the iommufd definition as an attribute under the <iommu> stanza, which would>
On Tue, May 27, 2025 at 06:43:46PM -0700, Nathan Chen wrote: translate to the following qemu argument:
-object '{"qom-type":"iommufd","id":"iommufd0"}'
but since this series implements support for multiple <iommu> definitions, we specify iommufd0 for multiple <iommu> stanzas. For x86_64, we would just specify the iommufd attribute once under a single <iommu> stanza.
Would you suggest we move iommufd out of the <iommu> definition instead, like the examples below?
AFAICT iommufd isn't connected to the guest iommu at all in terms of configuration, it is simply an attribute of the hostdev. eg we could do <hostdev mode='subsystem' type='mdev' model='vfio-pci' iommufd='on'> that does leave open the possibility that someone configures iommufd on one hostdev, but not on another, but that's not as bad as when we set it on the <iommu> too. So something we can validate in post-parse logic if we need to ensure consistent usage - if qemu allows a mix of iommfd and non-iommufd for vfio-pci, we can just allow that at libvirt too With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 5/30/2025 8:05 AM, Daniel P. Berrangé wrote:
Hi Daniel,
On 5/20/2025 5:51 AM, Daniel P. Berrangé wrote:
Hi,
This is a follow up to the first RFC patchset [0] for supporting multiple vSMMU instances in a qemu VM. This patchset also introduces support for using iommufd to propagate DMA mappings to kernel for assigned devices.
This patchset implements support for specifying multiple <iommu> devices within the VM definition when smmuv3Dev IOMMU model is specified, and is tested with Shameer's latest qemu RFC for HW-accelerated vSMMU devices [1]
Moreover, it adds a new 'iommufd' member for virDomainIOMMUDef, in order to represent the iommufd object in qemu command line. This patchset also implements new 'iommufdId' and 'iommufdFd' attributes for hostdev devices to be associated with the iommufd object.
For instance, specifying the iommufd object and associated hostdev in a VM definition with multiple IOMMUs, configured to be routed to pcie-expander-bus controllers in a way where VFIO device to SMMUv3 associations are matched with the host (pcie-expander-bus and pcie-root-port controllers are no longer auto-added/auto-routed like in the first revision of this RFC, as the PCIe topology will be configured by management apps):
<devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='252'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='2' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='248'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> ... <controller type='pci' index='21' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='21' port='0x0'/> <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </controller> <controller type='pci' index='22' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='22' port='0xa8'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </controller> ... <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> <iommufdId>iommufd0</iommufdId> <address type='pci' domain='0x0000' bus='0x15' slot='0x00' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0019' bus='0x01' slot='0x00' function='0x0'/> </source> <iommufdId>iommufd0</iommufdId> <address type='pci' domain='0x0000' bus='0x16' slot='0x00' function='0x0'/> </hostdev> <iommu model='smmuv3Dev'> <iommufd> <id>iommufd0</id> </iommufd> <address type='pci' domain='0x0000' bus='0x01' slot='0x01' function='0x0'/> IIUC, you're using <address> here to reference the earlier <controller> pcie-expander-bus. This is a bit wierd as it is making it look like the smmuv3Dev itself has a PCI address, but this is just the PCI address of the controller.
The smmuv3dev also doesn't have an address on the pcie-expander-bus, it is just an association IIUC.
So from this pov, I think I'd be inclined to say we should just reference the <controller> based on its index, using an attribute
<iommu model='smmuv3dev' controller='2'/>
I see, I will revise this to reference the controller index instead.
</iommu> <iommu model='smmuv3Dev'> <iommufd> <id>iommufd0</id> </iommufd> <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/> </iommu> </devices>
This would get translated to a qemu command line with the arguments below:
-device '{"driver":"pxb-pcie","bus_nr":252,"id":"pci.1","bus":"pcie.0","addr":"0x1"}' \ -device '{"driver":"pxb-pcie","bus_nr":248,"id":"pci.2","bus":"pcie.0","addr":"0x2"}' \ -device '{"driver":"pcie-root-port","port":0,"chassis":21,"id":"pci.21","bus":"pci.1","addr":"0x0"}' \ -device '{"driver":"pcie-root-port","port":168,"chassis":22,"id":"pci.22","bus":"pci.2","addr":"0x0"}' \ -object '{"qom-type":"iommufd","id":"iommufd0"}' \ -device '{"driver":"arm-smmuv3-accel","bus":"pci.1"}' \ -device '{"driver":"arm-smmuv3-accel","bus":"pci.2"}' \ -device '{"driver":"vfio-pci","host":"0009:01:00.0","id":"hostdev0","iommufd":"iommufd0","bus":"pci.21","addr":"0x0"}' \ -device '{"driver":"vfio-pci","host":"0019:01:00.0","id":"hostdev1","iommufd":"iommufd0","bus":"pci.22","addr":"0x0"}' \
The iommufd integration in the XML looks a bit wierd too - we have four different elements all referencing 'iommufd0' but nothing is defining this. The iommu references the iommufd0, but nothing actually uses this on the arm-smuv3-accel command line.
I've not been paying much attention to iommufd in QEMU, but IIUC it will apply to x86_64 too. So I'm wondering how iommufd integration sound work in libvirt more broadly.
It is my understanding that we want to consider device classes for libvirt device representation in XML, so I intended to have users declare the iommufd definition as an attribute under the <iommu> stanza, which would> translate to the following qemu argument: -object '{"qom-type":"iommufd","id":"iommufd0"}'
but since this series implements support for multiple <iommu> definitions, we specify iommufd0 for multiple <iommu> stanzas. For x86_64, we would just specify the iommufd attribute once under a single <iommu> stanza.
Would you suggest we move iommufd out of the <iommu> definition instead, like the examples below? AFAICT iommufd isn't connected to the guest iommu at all in terms of configuration, it is simply an attribute of the hostdev. eg we could do
<hostdev mode='subsystem' type='mdev' model='vfio-pci' iommufd='on'>
that does leave open the possibility that someone configures iommufd on one hostdev, but not on another, but that's not as bad as when we set it on the <iommu> too. So something we can validate in post-parse logic if we need to ensure consistent usage - if qemu allows a mix of iommfd and non-iommufd for vfio-pci, we can just allow that at libvirt too
Ok sounds good, I will make this only a hostdev attribute and generate the qemu iommufd object argument when detected instead of having users specify it under the guest iommu definition. We should be able to allow a mix of iommufd and non-iommufd for vfio-pci, and we can avoid the case where someone specifies it for the hostdev but not in the <iommu> stanza like you mentioned. Thanks, Nathan
participants (3)
-
Daniel P. Berrangé
-
Nathan Chen
-
Shameerali Kolothum Thodi