[PATCH 0/4] cover letter: qemu: Implement support for iommufd
Hi, This series implements support for using iommufd to propagate DMA mappings to the kernel for VM-assigned host devices in a qemu VM. We add a new 'iommufd' attribute for hostdev devices to be associated with the iommufd object. For instance, specifying the iommufd object and associated hostdev in a VM definition: <devices> ... <hostdev mode='subsystem' type='pci' managed='no'> <driver iommufd='yes'/> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x15' slot='0x00' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='no'> <driver iommufd='yes'/> <source> <address domain='0x0019' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x16' slot='0x00' function='0x0'/> </hostdev> ... </devices> This would get translated to a qemu command line with the arguments below. Note that libvirt will open the /dev/iommu and VFIO cdev, passing the associated fd number to qemu: -object '{"qom-type":"iommufd","id":"iommufd0","fd":"24"}' \ -device '{"driver":"vfio-pci","host":"0009:01:00.0","id":"hostdev0","iommufd":"iommufd0","fd":"22","bus":"pci.21","addr":"0x0"}' \ -device '{"driver":"vfio-pci","host":"0019:01:00.0","id":"hostdev1","iommufd":"iommufd0","fd":"25","bus":"pci.22","addr":"0x0"}' \ This series is on Github: https://github.com/NathanChenNVIDIA/libvirt/tree/iommufd-10-23-25 Thanks, Nathan Signed-off-by: Nathan Chen <nathanc@nvidia.com> Nathan Chen (4): qemu: Implement support for associating iommufd to hostdev qemu: open iommufd FDs from libvirt backend qemu: Update Cgroup, namespace, and seclabel for qemu to access iommufd paths tests: qemuxmlconfdata: provide iommufd sample XML and CLI args docs/formatdomain.rst | 8 + src/conf/device_conf.c | 9 + src/conf/device_conf.h | 1 + src/conf/schemas/basictypes.rng | 5 + src/qemu/qemu_cgroup.c | 61 +++++ src/qemu/qemu_cgroup.h | 1 + src/qemu/qemu_command.c | 62 ++++- src/qemu/qemu_command.h | 3 +- src/qemu/qemu_domain.c | 8 + src/qemu/qemu_domain.h | 7 + src/qemu/qemu_hotplug.c | 2 +- src/qemu/qemu_namespace.c | 44 ++++ src/qemu/qemu_process.c | 232 ++++++++++++++++++ src/security/security_apparmor.c | 15 ++ src/security/security_dac.c | 34 +++ src/security/security_selinux.c | 34 +++ src/security/virt-aa-helper.c | 11 +- src/util/virpci.c | 68 +++++ src/util/virpci.h | 1 + .../iommufd-q35.x86_64-latest.args | 41 ++++ .../iommufd-q35.x86_64-latest.xml | 60 +++++ tests/qemuxmlconfdata/iommufd-q35.xml | 38 +++ .../iommufd-virt.aarch64-latest.args | 33 +++ .../iommufd-virt.aarch64-latest.xml | 34 +++ tests/qemuxmlconfdata/iommufd-virt.xml | 22 ++ .../iommufd.x86_64-latest.args | 35 +++ .../qemuxmlconfdata/iommufd.x86_64-latest.xml | 38 +++ tests/qemuxmlconfdata/iommufd.xml | 30 +++ tests/qemuxmlconftest.c | 4 + 29 files changed, 934 insertions(+), 7 deletions(-) create mode 100644 tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.args create mode 100644 tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.xml create mode 100644 tests/qemuxmlconfdata/iommufd-q35.xml create mode 100644 tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.args create mode 100644 tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.xml create mode 100644 tests/qemuxmlconfdata/iommufd-virt.xml create mode 100644 tests/qemuxmlconfdata/iommufd.x86_64-latest.args create mode 100644 tests/qemuxmlconfdata/iommufd.x86_64-latest.xml create mode 100644 tests/qemuxmlconfdata/iommufd.xml -- 2.43.0
Implement a new iommufd attribute under hostdevs' PCI subsystem driver that can be used to specify associated iommufd object when launching a qemu VM. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- docs/formatdomain.rst | 8 ++++++++ src/conf/device_conf.c | 9 +++++++++ src/conf/device_conf.h | 1 + src/conf/schemas/basictypes.rng | 5 +++++ src/qemu/qemu_command.c | 19 +++++++++++++++++++ 5 files changed, 42 insertions(+) diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index 34dc9c3af7..a5c69dbcf4 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -4845,6 +4845,7 @@ or: device; if PCI ROM loading is disabled through this attribute, attempts to tweak the loading process further using the ``bar`` or ``file`` attributes will be rejected. :since:`Since 4.3.0 (QEMU and KVM only)`. + ``address`` The ``address`` element for USB devices has a ``bus`` and ``device`` attribute to specify the USB bus and device number the device appears at on @@ -4885,6 +4886,13 @@ or: found is "problematic" in some way, the generic vfio-pci driver similarly be forced. + The ``<driver>`` element's ``iommufd`` attribute is used to specify + using the iommufd interface to propagate DMA mappings to the kernel, + instead of legacy VFIO. When the attribute is present, an iommufd + object will be created by the resulting qemu command. Libvirt will + open the /dev/iommu and VFIO device cdev, passing the associated + file descriptor numbers to the qemu command. + (Note: :since:`Since 1.0.5`, the ``name`` attribute has been described to be used to select the type of PCI device assignment ("vfio", "kvm", or "xen"), but those values have been mostly diff --git a/src/conf/device_conf.c b/src/conf/device_conf.c index c278b81652..88979ecc39 100644 --- a/src/conf/device_conf.c +++ b/src/conf/device_conf.c @@ -60,6 +60,8 @@ int virDeviceHostdevPCIDriverInfoParseXML(xmlNodePtr node, virDeviceHostdevPCIDriverInfo *driver) { + virTristateBool iommufd; + driver->iommufd = false; if (virXMLPropEnum(node, "name", virDeviceHostdevPCIDriverNameTypeFromString, VIR_XML_PROP_NONZERO, @@ -67,6 +69,10 @@ virDeviceHostdevPCIDriverInfoParseXML(xmlNodePtr node, return -1; } + if (virXMLPropTristateBool(node, "iommufd", VIR_XML_PROP_NONE, &iommufd) < 0) + return -1; + virTristateBoolToBool(iommufd, &driver->iommufd); + driver->model = virXMLPropString(node, "model"); return 0; } @@ -93,6 +99,9 @@ virDeviceHostdevPCIDriverInfoFormat(virBuffer *buf, virBufferEscapeString(&driverAttrBuf, " model='%s'", driver->model); + if (driver->iommufd) + virBufferAddLit(&driverAttrBuf, " iommufd='yes'"); + virXMLFormatElement(buf, "driver", &driverAttrBuf, NULL); return 0; } diff --git a/src/conf/device_conf.h b/src/conf/device_conf.h index e570f51824..7bdbd80b0a 100644 --- a/src/conf/device_conf.h +++ b/src/conf/device_conf.h @@ -47,6 +47,7 @@ VIR_ENUM_DECL(virDeviceHostdevPCIDriverName); struct _virDeviceHostdevPCIDriverInfo { virDeviceHostdevPCIDriverName name; char *model; + bool iommufd; }; typedef enum { diff --git a/src/conf/schemas/basictypes.rng b/src/conf/schemas/basictypes.rng index 2931e316b7..089fc0f1c2 100644 --- a/src/conf/schemas/basictypes.rng +++ b/src/conf/schemas/basictypes.rng @@ -673,6 +673,11 @@ <ref name="genericName"/> </attribute> </optional> + <optional> + <attribute name="iommufd"> + <ref name="virYesNo"/> + </attribute> + </optional> <empty/> </element> </define> diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index c538a9fb2f..8fd7527645 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -4738,6 +4738,7 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, g_autofree char *host = virPCIDeviceAddressAsString(&pcisrc->addr); const char *failover_pair_id = NULL; const char *driver = NULL; + const char *iommufdId = NULL; /* 'ramfb' property must be omitted unless it's to be enabled */ bool ramfb = pcisrc->ramfb == VIR_TRISTATE_SWITCH_ON; @@ -4771,6 +4772,9 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, teaming->persistent) failover_pair_id = teaming->persistent; + if (pcisrc->driver.iommufd) + iommufdId = "iommufd0"; + if (virJSONValueObjectAdd(&props, "s:driver", driver, "s:host", host, @@ -4779,6 +4783,7 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, "S:failover_pair_id", failover_pair_id, "S:display", qemuOnOffAuto(pcisrc->display), "B:ramfb", ramfb, + "S:iommufd", iommufdId, NULL) < 0) return NULL; @@ -5195,6 +5200,9 @@ qemuBuildHostdevCommandLine(virCommand *cmd, virQEMUCaps *qemuCaps) { size_t i; + g_autoptr(virJSONValue) props = NULL; + int iommufd = 0; + const char * iommufdId = "iommufd0"; for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevDef *hostdev = def->hostdevs[i]; @@ -5223,6 +5231,17 @@ qemuBuildHostdevCommandLine(virCommand *cmd, if (hostdev->info->type == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED) continue; + if (subsys->u.pci.driver.iommufd && iommufd == 0) { + iommufd = 1; + if (qemuMonitorCreateObjectProps(&props, "iommufd", + iommufdId, + NULL) < 0) + return -1; + + if (qemuBuildObjectCommandlineFromJSON(cmd, props) < 0) + return -1; + } + if (qemuCommandAddExtDevice(cmd, hostdev->info, def, qemuCaps) < 0) return -1; -- 2.43.0
[cc-ing Laine and Andrea if they have a better memory of the time we went from "legacy" passthrough to vfio] On a Monday in 2025, Nathan Chen via Devel wrote:
Implement a new iommufd attribute under hostdevs' PCI subsystem driver that can be used to specify associated iommufd object when launching a qemu VM.
This does not specify which iommufd object it is, just to use the default one. It's perfect for now, we might need a different element if using anything else than iommufd0 starts making sense. Also, I think it should fine not to expose the object in the XML since it has configurable attributes now: # qemu-system-x86_64 -object iommufd,? iommufd options: fd=<string>
Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- docs/formatdomain.rst | 8 ++++++++ src/conf/device_conf.c | 9 +++++++++ src/conf/device_conf.h | 1 + src/conf/schemas/basictypes.rng | 5 +++++ src/qemu/qemu_command.c | 19 +++++++++++++++++++ 5 files changed, 42 insertions(+)
diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index 34dc9c3af7..a5c69dbcf4 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -4845,6 +4845,7 @@ or: device; if PCI ROM loading is disabled through this attribute, attempts to tweak the loading process further using the ``bar`` or ``file`` attributes will be rejected. :since:`Since 4.3.0 (QEMU and KVM only)`. + ``address`` The ``address`` element for USB devices has a ``bus`` and ``device`` attribute to specify the USB bus and device number the device appears at on @@ -4885,6 +4886,13 @@ or: found is "problematic" in some way, the generic vfio-pci driver similarly be forced.
+ The ``<driver>`` element's ``iommufd`` attribute is used to specify + using the iommufd interface to propagate DMA mappings to the kernel, + instead of legacy VFIO. When the attribute is present, an iommufd + object will be created by the resulting qemu command. Libvirt will + open the /dev/iommu and VFIO device cdev, passing the associated + file descriptor numbers to the qemu command. +
Should we resurrect the old attribute and use: <driver name="iommufd"/> The idea being that later in time, when it will no longer make sense to use "legacy" VFIO, we will retire it again. Also, referring to it as "legacy" is both premature (since iommufd does not have the feature parity yet) and confusing in the passage of time.
(Note: :since:`Since 1.0.5`, the ``name`` attribute has been described to be used to select the type of PCI device assignment ("vfio", "kvm", or "xen"), but those values have been mostly diff --git a/src/conf/device_conf.c b/src/conf/device_conf.c index c278b81652..88979ecc39 100644 --- a/src/conf/device_conf.c +++ b/src/conf/device_conf.c @@ -60,6 +60,8 @@ int virDeviceHostdevPCIDriverInfoParseXML(xmlNodePtr node, virDeviceHostdevPCIDriverInfo *driver) { + virTristateBool iommufd; + driver->iommufd = false; if (virXMLPropEnum(node, "name", virDeviceHostdevPCIDriverNameTypeFromString, VIR_XML_PROP_NONZERO, @@ -67,6 +69,10 @@ virDeviceHostdevPCIDriverInfoParseXML(xmlNodePtr node, return -1; }
+ if (virXMLPropTristateBool(node, "iommufd", VIR_XML_PROP_NONE, &iommufd) < 0) + return -1; + virTristateBoolToBool(iommufd, &driver->iommufd);
Storing this as 'bool' is losing information. We need to be able to tell whether iommufd was not used because the user did not specify it or whether it was not used because the user explicitly said no for future compatibility reasons. Jano
+ driver->model = virXMLPropString(node, "model"); return 0; } @@ -93,6 +99,9 @@ virDeviceHostdevPCIDriverInfoFormat(virBuffer *buf,
virBufferEscapeString(&driverAttrBuf, " model='%s'", driver->model);
+ if (driver->iommufd) + virBufferAddLit(&driverAttrBuf, " iommufd='yes'"); + virXMLFormatElement(buf, "driver", &driverAttrBuf, NULL); return 0; }
On 11/6/2025 10:49 AM, Ján Tomko wrote:
Implement a new iommufd attribute under hostdevs' PCI subsystem driver that can be used to specify associated iommufd object when launching a qemu VM.
This does not specify which iommufd object it is, just to use the default one.
It's perfect for now, we might need a different element if using anything else than iommufd0 starts making sense.
Also, I think it should fine not to expose the object in the XML since it has configurable attributes now:
# qemu-system-x86_64 -object iommufd,? iommufd options: fd=<string>
Noted, will re-visit if anything else other than iommufd0 makes sense.
Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- docs/formatdomain.rst | 8 ++++++++ src/conf/device_conf.c | 9 +++++++++ src/conf/device_conf.h | 1 + src/conf/schemas/basictypes.rng | 5 +++++ src/qemu/qemu_command.c | 19 +++++++++++++++++++ 5 files changed, 42 insertions(+)
diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index 34dc9c3af7..a5c69dbcf4 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -4845,6 +4845,7 @@ or: device; if PCI ROM loading is disabled through this attribute, attempts to tweak the loading process further using the ``bar`` or ``file`` attributes will be rejected. :since:`Since 4.3.0 (QEMU and KVM only)`. + ``address`` The ``address`` element for USB devices has a ``bus`` and ``device`` attribute to specify the USB bus and device number the device appears at on @@ -4885,6 +4886,13 @@ or: found is "problematic" in some way, the generic vfio-pci driver similarly be forced.
+ The ``<driver>`` element's ``iommufd`` attribute is used to specify + using the iommufd interface to propagate DMA mappings to the kernel, + instead of legacy VFIO. When the attribute is present, an iommufd + object will be created by the resulting qemu command. Libvirt will + open the /dev/iommu and VFIO device cdev, passing the associated + file descriptor numbers to the qemu command. +
Should we resurrect the old attribute and use: <driver name="iommufd"/>
The idea being that later in time, when it will no longer make sense to use "legacy" VFIO, we will retire it again.
Also, referring to it as "legacy" is both premature (since iommufd does not have the feature parity yet) and confusing in the passage of time.
I think it would be better to leave it as-is for now, since there are variant VFIO drivers besides vfio-pci that could be assigned to the driver name attribute in tandem with enabling iommufd.
(Note: :since:`Since 1.0.5`, the ``name`` attribute has been described to be used to select the type of PCI device assignment ("vfio", "kvm", or "xen"), but those values have been mostly diff --git a/src/conf/device_conf.c b/src/conf/device_conf.c index c278b81652..88979ecc39 100644 --- a/src/conf/device_conf.c +++ b/src/conf/device_conf.c @@ -60,6 +60,8 @@ int virDeviceHostdevPCIDriverInfoParseXML(xmlNodePtr node, virDeviceHostdevPCIDriverInfo *driver) { + virTristateBool iommufd; + driver->iommufd = false; if (virXMLPropEnum(node, "name", virDeviceHostdevPCIDriverNameTypeFromString, VIR_XML_PROP_NONZERO, @@ -67,6 +69,10 @@ virDeviceHostdevPCIDriverInfoParseXML(xmlNodePtr node, return -1; }
+ if (virXMLPropTristateBool(node, "iommufd", VIR_XML_PROP_NONE, &iommufd) < 0) + return -1; + virTristateBoolToBool(iommufd, &driver->iommufd);
Storing this as 'bool' is losing information. We need to be able to tell whether iommufd was not used because the user did not specify it or whether it was not used because the user explicitly said no for future compatibility reasons.
That makes sense, I will update it to use virTristateBool instead in the next revision. -Nathan
TL;DR of all my rambling below - I think just adding "iommfd='yes'" attribute (as a tristate like Jano suggested) is fine for now, and don't think we should do anything with the name attribute. More details below if you're really in for a read :-) On 11/6/25 7:29 PM, Nathan Chen via Devel wrote:
On 11/6/2025 10:49 AM, Ján Tomko wrote:
Implement a new iommufd attribute under hostdevs' PCI subsystem driver that can be used to specify associated iommufd object when launching a qemu VM.
This does not specify which iommufd object it is, just to use the default one.
It's perfect for now, we might need a different element if using anything else than iommufd0 starts making sense.
Yeah, I think earlier versions of the patches explicitly gave the iommufd object name used by each device (e.g. literally "iommufd0"), and we deemed that "too much information", recommending to instead just say "use it" or "don't use it", and then later we can add an iommufdIndex or something that would default to 0, and then could contain other values if multiple iommufd objects were needed (and so, e.g., if two devices had "iommufd='yes' iommufdIndex='1'" then they would both be setup to use the same (non-default) iommufd (maybe "iommufd1"). So for right now while we're just supporting a single iommufd object per domain, the current proposed XML should be fine.
Also, I think it should fine not to expose the object in the XML since it has configurable attributes now:
I think you mean "*no* configurable attributes"?
# qemu-system-x86_64 -object iommufd,? iommufd options: fd=<string>
Noted, will re-visit if anything else other than iommufd0 makes sense.
Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- docs/formatdomain.rst | 8 ++++++++ src/conf/device_conf.c | 9 +++++++++ src/conf/device_conf.h | 1 + src/conf/schemas/basictypes.rng | 5 +++++ src/qemu/qemu_command.c | 19 +++++++++++++++++++ 5 files changed, 42 insertions(+)
diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index 34dc9c3af7..a5c69dbcf4 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -4845,6 +4845,7 @@ or: device; if PCI ROM loading is disabled through this attribute, attempts to tweak the loading process further using the ``bar`` or ``file`` attributes will be rejected. :since:`Since 4.3.0 (QEMU and KVM only)`. + ``address`` The ``address`` element for USB devices has a ``bus`` and ``device`` attribute to specify the USB bus and device number the device appears at on @@ -4885,6 +4886,13 @@ or: found is "problematic" in some way, the generic vfio-pci driver similarly be forced.
+ The ``<driver>`` element's ``iommufd`` attribute is used to specify + using the iommufd interface to propagate DMA mappings to the kernel, + instead of legacy VFIO. When the attribute is present, an iommufd + object will be created by the resulting qemu command. Libvirt will + open the /dev/iommu and VFIO device cdev, passing the associated + file descriptor numbers to the qemu command. +
Should we resurrect the old attribute and use: <driver name="iommufd"/>
The idea being that later in time, when it will no longer make sense to use "legacy" VFIO, we will retire it again.
My understanding is that this is still classified as "VFIO device assignment", but just using an iommufd for communication, so it's not "let's do this instead of VFIO", but "let's do VFIO *this* way instead of the other way". Meanwhile, you'd asked earlier about memories of the switch from "legacy KVM" device assignment to VFIO. One thing that's really important to know about that change (when thinking about it as a model to follow for this current change) is that during those days the presence of any PCI device assigned from the host to a guest would render the domain unmigrateable, and so when thinking about the transition of the default from one to the other we didn't need to consider the possibility of migrating a running guest from legacy KVM to VFIO - in order to switch from one to the other you had to shutdown and then restart the domain. We also kept around the "<driver name='kvm'/> nearly a decade longer than it was likely necessary - we only added the ability to manually select at all because someone "closer to customers/users" had insisted we needed a way to switch back to the "old way" if there was a bug in VFIO. But this was never needed - from the very beginning VFIO worked better than legacy KVM, and there were no "missing" features that would require someone to use legacy KVM assignment. I recall removing at least part of the supporting code for legacy KVM assignment several years ago (pretty sure someone else removed the final vestiges) and at the time thinking to myself "all this work that made the code and the configuration more complicated, and made maintenance more complicated and time consuming, only to *never* use it, and then finally remove it 10 years later. How depressing :-/".
Also, referring to it as "legacy" is both premature (since iommufd does not have the feature parity yet) and confusing in the passage of time.
I think it would be better to leave it as-is for now, since there are variant VFIO drivers besides vfio-pci that could be assigned to the driver name attribute in tandem with enabling iommufd. Actually a vfio variant driver (other than the variant driver that is automatically discovered as "most appropriate" for the device) is configured with <driver model='blah'/>, not name='blah'.
(Note: :since:`Since 1.0.5`, the ``name`` attribute has been described to be used to select the type of PCI device assignment ("vfio", "kvm", or "xen"), but those values have been mostly diff --git a/src/conf/device_conf.c b/src/conf/device_conf.c index c278b81652..88979ecc39 100644 --- a/src/conf/device_conf.c +++ b/src/conf/device_conf.c @@ -60,6 +60,8 @@ int virDeviceHostdevPCIDriverInfoParseXML(xmlNodePtr node, virDeviceHostdevPCIDriverInfo *driver) { + virTristateBool iommufd; + driver->iommufd = false; if (virXMLPropEnum(node, "name", virDeviceHostdevPCIDriverNameTypeFromString, VIR_XML_PROP_NONZERO, @@ -67,6 +69,10 @@ virDeviceHostdevPCIDriverInfoParseXML(xmlNodePtr node, return -1; }
+ if (virXMLPropTristateBool(node, "iommufd", VIR_XML_PROP_NONE, &iommufd) < 0) + return -1; + virTristateBoolToBool(iommufd, &driver->iommufd);
Storing this as 'bool' is losing information. We need to be able to tell whether iommufd was not used because the user did not specify it or whether it was not used because the user explicitly said no for future compatibility reasons.
+1
That makes sense, I will update it to use virTristateBool instead in the next revision.
-Nathan
On Fri, Nov 21, 2025 at 10:30:45AM -0500, Laine Stump wrote:
On 11/6/2025 10:49 AM, Ján Tomko wrote:
This does not specify which iommufd object it is, just to use the default one.
It's perfect for now, we might need a different element if using anything else than iommufd0 starts making sense.
Yeah, I think earlier versions of the patches explicitly gave the iommufd object name used by each device (e.g. literally "iommufd0"), and we deemed that "too much information", recommending to instead just say "use it" or "don't use it", and then later we can add an iommufdIndex or something that would default to 0, and then could contain other values if multiple iommufd objects were needed (and so, e.g., if two devices had "iommufd='yes' iommufdIndex='1'" then they would both be setup to use the same (non-default) iommufd (maybe "iommufd1").
So for right now while we're just supporting a single iommufd object per domain, the current proposed XML should be fine.
Link to the relevant bit from that previous conversation: [1]. I agree that we can just extend the schema with an additional attribute if and when multiple iommufds become something that we actually want.
Should we resurrect the old attribute and use: <driver name="iommufd"/>
The idea being that later in time, when it will no longer make sense to use "legacy" VFIO, we will retire it again.
My understanding is that this is still classified as "VFIO device assignment", but just using an iommufd for communication, so it's not "let's do this instead of VFIO", but "let's do VFIO *this* way instead of the other way".
Quoting Alex Williamson[2]: [...] while initially IOMMUFD is only used by vfio, the intention is that it becomes the default userspace IOMMU interface for not only vfio, but also vdpa and similar technologies. So I agree with your take that a new attribute is more appropriate. Further down the line, we can add the same attribute to other devices that can take advantage of iommufd. Overall, the current implementation seems fine to me, modulo of course the issues that have already been pointed out. [1] https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/message/QRYZ... [2] https://issues.redhat.com/browse/RHEL-36153?focusedId=24825981&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24825981 -- Andrea Bolognani / Red Hat / Virtualization
Open iommufd FDs from libvirt backend without exposing these FDs to XML users, i.e. one per domain for /dev/iommu and one per iommufd hostdev for /dev/vfio/devices/vfioX, and pass the FD to qemu command line. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/qemu/qemu_command.c | 43 +++++++- src/qemu/qemu_command.h | 3 +- src/qemu/qemu_domain.c | 8 ++ src/qemu/qemu_domain.h | 7 ++ src/qemu/qemu_hotplug.c | 2 +- src/qemu/qemu_process.c | 232 ++++++++++++++++++++++++++++++++++++++++ 6 files changed, 289 insertions(+), 6 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 8fd7527645..740a6970f2 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -4730,7 +4730,8 @@ qemuBuildVideoCommandLine(virCommand *cmd, virJSONValue * qemuBuildPCIHostdevDevProps(const virDomainDef *def, - virDomainHostdevDef *dev) + virDomainHostdevDef *dev, + virDomainObj *vm) { g_autoptr(virJSONValue) props = NULL; virDomainHostdevSubsysPCI *pcisrc = &dev->source.subsys.u.pci; @@ -4741,6 +4742,13 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, const char *iommufdId = NULL; /* 'ramfb' property must be omitted unless it's to be enabled */ bool ramfb = pcisrc->ramfb == VIR_TRISTATE_SWITCH_ON; + bool useIommufd = false; + qemuDomainObjPrivate *priv = vm ? vm->privateData : NULL; + + if (pcisrc->driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO && + pcisrc->driver.iommufd) { + useIommufd = true; + } /* caller has to assign proper passthrough driver name */ switch (pcisrc->driver.name) { @@ -4787,6 +4795,18 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, NULL) < 0) return NULL; + if (useIommufd && priv) { + g_autofree char *vfioFdName = g_strdup_printf("vfio-%04x:%02x:%02x.%d", + pcisrc->addr.domain, pcisrc->addr.bus, + pcisrc->addr.slot, pcisrc->addr.function); + + int vfiofd = GPOINTER_TO_INT(g_hash_table_lookup(priv->vfioDeviceFds, vfioFdName)); + if (virJSONValueObjectAdd(&props, + "S:fd", g_strdup_printf("%d", vfiofd), + NULL) < 0) + return NULL; + } + if (qemuBuildDeviceAddressProps(props, def, dev->info) < 0) return NULL; @@ -5197,12 +5217,14 @@ qemuBuildAcpiNodesetProps(virCommand *cmd, static int qemuBuildHostdevCommandLine(virCommand *cmd, const virDomainDef *def, - virQEMUCaps *qemuCaps) + virQEMUCaps *qemuCaps, + virDomainObj *vm) { size_t i; g_autoptr(virJSONValue) props = NULL; int iommufd = 0; const char * iommufdId = "iommufd0"; + qemuDomainObjPrivate *priv = vm->privateData; for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevDef *hostdev = def->hostdevs[i]; @@ -5233,8 +5255,10 @@ qemuBuildHostdevCommandLine(virCommand *cmd, if (subsys->u.pci.driver.iommufd && iommufd == 0) { iommufd = 1; + virCommandPassFD(cmd, priv->iommufd, VIR_COMMAND_PASS_FD_CLOSE_PARENT); if (qemuMonitorCreateObjectProps(&props, "iommufd", iommufdId, + "S:fd", g_strdup_printf("%d", priv->iommufd), NULL) < 0) return -1; @@ -5245,7 +5269,18 @@ qemuBuildHostdevCommandLine(virCommand *cmd, if (qemuCommandAddExtDevice(cmd, hostdev->info, def, qemuCaps) < 0) return -1; - if (!(devprops = qemuBuildPCIHostdevDevProps(def, hostdev))) + if (subsys->u.pci.driver.iommufd) { + virDomainHostdevSubsysPCI *pcisrc = &hostdev->source.subsys.u.pci; + g_autofree char *vfioFdName = g_strdup_printf("vfio-%04x:%02x:%02x.%d", + pcisrc->addr.domain, pcisrc->addr.bus, + pcisrc->addr.slot, pcisrc->addr.function); + + int vfiofd = GPOINTER_TO_INT(g_hash_table_lookup(priv->vfioDeviceFds, vfioFdName)); + + virCommandPassFD(cmd, vfiofd, VIR_COMMAND_PASS_FD_CLOSE_PARENT); + } + + if (!(devprops = qemuBuildPCIHostdevDevProps(def, hostdev, vm))) return -1; if (qemuBuildDeviceCommandlineFromJSON(cmd, devprops, def, qemuCaps) < 0) @@ -10893,7 +10928,7 @@ qemuBuildCommandLine(virDomainObj *vm, if (qemuBuildRedirdevCommandLine(cmd, def, qemuCaps) < 0) return NULL; - if (qemuBuildHostdevCommandLine(cmd, def, qemuCaps) < 0) + if (qemuBuildHostdevCommandLine(cmd, def, qemuCaps, vm) < 0) return NULL; if (migrateURI) diff --git a/src/qemu/qemu_command.h b/src/qemu/qemu_command.h index ad068f1f16..380aac261f 100644 --- a/src/qemu/qemu_command.h +++ b/src/qemu/qemu_command.h @@ -180,7 +180,8 @@ qemuBuildThreadContextProps(virJSONValue **tcProps, /* Current, best practice */ virJSONValue * qemuBuildPCIHostdevDevProps(const virDomainDef *def, - virDomainHostdevDef *dev); + virDomainHostdevDef *dev, + virDomainObj *vm); virJSONValue * qemuBuildRNGDevProps(const virDomainDef *def, diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index a42721efad..86640aa3e3 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -1953,6 +1953,11 @@ qemuDomainObjPrivateFree(void *data) virChrdevFree(priv->devs); + if (priv->iommufd >= 0) { + virEventRemoveHandle(priv->iommufd); + priv->iommufd = -1; + } + if (priv->pidMonitored >= 0) { virEventRemoveHandle(priv->pidMonitored); priv->pidMonitored = -1; @@ -1974,6 +1979,7 @@ qemuDomainObjPrivateFree(void *data) g_clear_pointer(&priv->blockjobs, g_hash_table_unref); g_clear_pointer(&priv->fds, g_hash_table_unref); + g_clear_pointer(&priv->vfioDeviceFds, g_hash_table_unref); /* This should never be non-NULL if we get here, but just in case... */ if (priv->eventThread) { @@ -2002,7 +2008,9 @@ qemuDomainObjPrivateAlloc(void *opaque) priv->blockjobs = virHashNew(virObjectUnref); priv->fds = virHashNew(g_object_unref); + priv->vfioDeviceFds = g_hash_table_new(g_str_hash, g_str_equal); + priv->iommufd = -1; priv->pidMonitored = -1; /* agent commands block by default, user can choose different behavior */ diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 3396f929fd..d6214df783 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -264,6 +264,10 @@ struct _qemuDomainObjPrivate { /* named file descriptor groups associated with the VM */ GHashTable *fds; + int iommufd; + + GHashTable *vfioDeviceFds; + char *memoryBackingDir; }; @@ -1174,3 +1178,6 @@ qemuDomainCheckCPU(virArch arch, bool qemuDomainMachineSupportsFloppy(const char *machine, virQEMUCaps *qemuCaps); + +int qemuProcessOpenVfioFds(virDomainObj *vm); +void qemuProcessCloseVfioFds(virDomainObj *vm); diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index fb426deb1a..661e9008f7 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -1630,7 +1630,7 @@ qemuDomainAttachHostPCIDevice(virQEMUDriver *driver, goto error; } - if (!(devprops = qemuBuildPCIHostdevDevProps(vm->def, hostdev))) + if (!(devprops = qemuBuildPCIHostdevDevProps(vm->def, hostdev, vm))) goto error; qemuDomainObjEnterMonitor(vm); diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 45fc32a663..cecfed94a7 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -25,6 +25,7 @@ #include <unistd.h> #include <signal.h> #include <sys/stat.h> +#include <dirent.h> #if WITH_SYS_SYSCALL_H # include <sys/syscall.h> #endif @@ -8091,6 +8092,9 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuExtDevicesStart(driver, vm, incomingMigrationExtDevices) < 0) goto cleanup; + if (qemuProcessOpenVfioFds(vm) < 0) + goto cleanup; + if (!(cmd = qemuBuildCommandLine(vm, incoming ? "defer" : NULL, vmop, @@ -10267,3 +10271,231 @@ qemuProcessHandleNbdkitExit(qemuNbdkitProcess *nbdkit, qemuProcessEventSubmit(vm, QEMU_PROCESS_EVENT_NBDKIT_EXITED, 0, 0, nbdkit); virObjectUnlock(vm); } + +/** + * qemuProcessOpenIommuFd: + * @vm: domain object + * @iommuFd: returned file descriptor + * + * Opens /dev/iommu file descriptor for the VM. + * + * Returns: 0 on success, -1 on failure + */ +static int +qemuProcessOpenIommuFd(virDomainObj *vm, int *iommuFd) +{ + int fd = -1; + + VIR_DEBUG("Opening IOMMU FD for domain %s", vm->def->name); + + if ((fd = open("/dev/iommu", O_RDWR | O_CLOEXEC)) < 0) { + if (errno == ENOENT) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("IOMMU FD support requires /dev/iommu device")); + } else { + virReportSystemError(errno, "%s", + _("cannot open /dev/iommu")); + } + return -1; + } + + *iommuFd = fd; + VIR_DEBUG("Opened IOMMU FD %d for domain %s", fd, vm->def->name); + return 0; +} + +/** + * qemuProcessGetVfioDevicePath: + * @hostdev: host device definition + * @vfioPath: returned VFIO device path + * + * Constructs the VFIO device path for a PCI hostdev. + * + * Returns: 0 on success, -1 on failure + */ +static int +qemuProcessGetVfioDevicePath(virDomainHostdevDef *hostdev, + char **vfioPath) +{ + virPCIDeviceAddress *addr; + g_autofree char *sysfsPath = NULL; + DIR *dir = NULL; + struct dirent *entry = NULL; + int ret = -1; + + if (hostdev->mode != VIR_DOMAIN_HOSTDEV_MODE_SUBSYS || + hostdev->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("VFIO FD only supported for PCI hostdevs")); + return -1; + } + + addr = &hostdev->source.subsys.u.pci.addr; + + /* Build sysfs path: /sys/bus/pci/devices/DDDD:BB:DD.F/vfio-dev/ */ + sysfsPath = g_strdup_printf("/sys/bus/pci/devices/" + "%04x:%02x:%02x.%d/vfio-dev/", + addr->domain, addr->bus, + addr->slot, addr->function); + + if (virDirOpen(&dir, sysfsPath) < 0) { + virReportSystemError(errno, + _("cannot open VFIO sysfs directory %1$s"), + sysfsPath); + return -1; + } + + /* Find the vfio device name in the directory */ + while (virDirRead(dir, &entry, sysfsPath) > 0) { + if (STRPREFIX(entry->d_name, "vfio")) { + *vfioPath = g_strdup_printf("/dev/vfio/devices/%s", entry->d_name); + ret = 0; + break; + } + } + + if (ret < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("cannot find VFIO device for PCI device %1$04x:%2$02x:%3$02x.%4$d"), + addr->domain, addr->bus, addr->slot, addr->function); + } + + virDirClose(dir); + return ret; +} + +/** + * qemuProcessOpenVfioDeviceFd: + * @hostdev: host device definition + * @vfioFd: returned file descriptor + * + * Opens the VFIO device file descriptor for a hostdev. + * + * Returns: 0 on success, -1 on failure + */ +static int +qemuProcessOpenVfioDeviceFd(virDomainHostdevDef *hostdev, + int *vfioFd) +{ + g_autofree char *vfioPath = NULL; + int fd = -1; + + if (qemuProcessGetVfioDevicePath(hostdev, &vfioPath) < 0) + return -1; + + VIR_DEBUG("Opening VFIO device %s", vfioPath); + + if ((fd = open(vfioPath, O_RDWR | O_CLOEXEC)) < 0) { + if (errno == ENOENT) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("VFIO device %1$s not found - ensure device is bound to vfio-pci driver"), + vfioPath); + } else { + virReportSystemError(errno, + _("cannot open VFIO device %1$s"), vfioPath); + } + return -1; + } + + *vfioFd = fd; + VIR_DEBUG("Opened VFIO device FD %d for %s", *vfioFd, vfioPath); + return 0; +} + +/** + * qemuProcessOpenVfioFds: + * @vm: domain object + * + * Opens all necessary VFIO file descriptors for the domain. + * + * Returns: 0 on success, -1 on failure + */ +int +qemuProcessOpenVfioFds(virDomainObj *vm) +{ + qemuDomainObjPrivate *priv = vm->privateData; + bool needsIommuFd = false; + size_t i; + + /* Check if we have any hostdevs that need VFIO FDs */ + for (i = 0; i < vm->def->nhostdevs; i++) { + virDomainHostdevDef *hostdev = vm->def->hostdevs[i]; + int vfioFd = -1; + g_autofree char *fdname = NULL; + + if (hostdev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && + hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { + + /* Check if this hostdev uses VFIO with IOMMU FD */ + if (hostdev->source.subsys.u.pci.driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO && + hostdev->source.subsys.u.pci.driver.iommufd) { + + needsIommuFd = true; + + /* Open VFIO device FD */ + if (qemuProcessOpenVfioDeviceFd(hostdev, &vfioFd) < 0) + goto error; + + /* Store the FD */ + fdname = g_strdup_printf("vfio-%04x:%02x:%02x.%d", + hostdev->source.subsys.u.pci.addr.domain, + hostdev->source.subsys.u.pci.addr.bus, + hostdev->source.subsys.u.pci.addr.slot, + hostdev->source.subsys.u.pci.addr.function); + + g_hash_table_insert(priv->vfioDeviceFds, g_steal_pointer(&fdname), GINT_TO_POINTER(vfioFd)); + + VIR_DEBUG("Stored VFIO FD for device %s", fdname); + } + } + } + + /* Open IOMMU FD if needed */ + if (needsIommuFd) { + int iommuFd = -1; + + if (qemuProcessOpenIommuFd(vm, &iommuFd) < 0) + goto error; + + priv->iommufd = iommuFd; + + VIR_DEBUG("Stored IOMMU FD"); + } + + return 0; + + error: + qemuProcessCloseVfioFds(vm); + return -1; +} + +/** + * qemuProcessCloseVfioFds: + * @vm: domain object + * + * Closes all VFIO file descriptors for the domain. + */ +void +qemuProcessCloseVfioFds(virDomainObj *vm) +{ + qemuDomainObjPrivate *priv = vm->privateData; + GHashTableIter iter; + gpointer key, value; + + /* Close all VFIO device FDs */ + if (priv->vfioDeviceFds) { + g_hash_table_iter_init(&iter, priv->vfioDeviceFds); + while (g_hash_table_iter_next(&iter, &key, &value)) { + int fd = GPOINTER_TO_INT(value); + VIR_DEBUG("Closing VFIO device FD %d for %s", fd, (char*)key); + VIR_FORCE_CLOSE(fd); + } + g_hash_table_remove_all(priv->vfioDeviceFds); + } + + /* Close IOMMU FD */ + if (priv->iommufd >= 0) { + VIR_DEBUG("Closing IOMMU FD %d", priv->iommufd); + VIR_FORCE_CLOSE(priv->iommufd); + } +} -- 2.43.0
On a Monday in 2025, Nathan Chen via Devel wrote:
Open iommufd FDs from libvirt backend without exposing these FDs to XML users, i.e. one per domain for /dev/iommu and one per iommufd hostdev for /dev/vfio/devices/vfioX, and pass the FD to qemu command line.
The part formatting the object and the part formatting the device should be split.
Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/qemu/qemu_command.c | 43 +++++++- src/qemu/qemu_command.h | 3 +- src/qemu/qemu_domain.c | 8 ++ src/qemu/qemu_domain.h | 7 ++ src/qemu/qemu_hotplug.c | 2 +- src/qemu/qemu_process.c | 232 ++++++++++++++++++++++++++++++++++++++++ 6 files changed, 289 insertions(+), 6 deletions(-)
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 8fd7527645..740a6970f2 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -4730,7 +4730,8 @@ qemuBuildVideoCommandLine(virCommand *cmd,
virJSONValue * qemuBuildPCIHostdevDevProps(const virDomainDef *def, - virDomainHostdevDef *dev) + virDomainHostdevDef *dev, + virDomainObj *vm)
Hmm, perhaps exposing the iommufd object in the XML would save us from having to pass this.
{ g_autoptr(virJSONValue) props = NULL; virDomainHostdevSubsysPCI *pcisrc = &dev->source.subsys.u.pci; @@ -4741,6 +4742,13 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, const char *iommufdId = NULL; /* 'ramfb' property must be omitted unless it's to be enabled */ bool ramfb = pcisrc->ramfb == VIR_TRISTATE_SWITCH_ON; + bool useIommufd = false; + qemuDomainObjPrivate *priv = vm ? vm->privateData : NULL; + + if (pcisrc->driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO && + pcisrc->driver.iommufd) { + useIommufd = true; + }
/* caller has to assign proper passthrough driver name */ switch (pcisrc->driver.name) { @@ -4787,6 +4795,18 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, NULL) < 0) return NULL;
+ if (useIommufd && priv) { + g_autofree char *vfioFdName = g_strdup_printf("vfio-%04x:%02x:%02x.%d", + pcisrc->addr.domain, pcisrc->addr.bus, + pcisrc->addr.slot, pcisrc->addr.function); +
There's no need to duplicate the list of hostdevs which use iommufd in a per-domain hash table. For storing per-device file descriptors, we have per-device private data.
+ int vfiofd = GPOINTER_TO_INT(g_hash_table_lookup(priv->vfioDeviceFds, vfioFdName)); + if (virJSONValueObjectAdd(&props, + "S:fd", g_strdup_printf("%d", vfiofd), + NULL) < 0) + return NULL; + } + if (qemuBuildDeviceAddressProps(props, def, dev->info) < 0) return NULL;
@@ -5197,12 +5217,14 @@ qemuBuildAcpiNodesetProps(virCommand *cmd, static int qemuBuildHostdevCommandLine(virCommand *cmd, const virDomainDef *def, - virQEMUCaps *qemuCaps) + virQEMUCaps *qemuCaps, + virDomainObj *vm) { size_t i; g_autoptr(virJSONValue) props = NULL; int iommufd = 0; const char * iommufdId = "iommufd0"; + qemuDomainObjPrivate *priv = vm->privateData;
for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevDef *hostdev = def->hostdevs[i]; @@ -5233,8 +5255,10 @@ qemuBuildHostdevCommandLine(virCommand *cmd,
if (subsys->u.pci.driver.iommufd && iommufd == 0) { iommufd = 1; + virCommandPassFD(cmd, priv->iommufd, VIR_COMMAND_PASS_FD_CLOSE_PARENT); if (qemuMonitorCreateObjectProps(&props, "iommufd", iommufdId, + "S:fd", g_strdup_printf("%d", priv->iommufd), NULL) < 0) return -1;
@@ -5245,7 +5269,18 @@ qemuBuildHostdevCommandLine(virCommand *cmd, if (qemuCommandAddExtDevice(cmd, hostdev->info, def, qemuCaps) < 0) return -1;
- if (!(devprops = qemuBuildPCIHostdevDevProps(def, hostdev))) + if (subsys->u.pci.driver.iommufd) { + virDomainHostdevSubsysPCI *pcisrc = &hostdev->source.subsys.u.pci; + g_autofree char *vfioFdName = g_strdup_printf("vfio-%04x:%02x:%02x.%d", + pcisrc->addr.domain, pcisrc->addr.bus, + pcisrc->addr.slot, pcisrc->addr.function); + + int vfiofd = GPOINTER_TO_INT(g_hash_table_lookup(priv->vfioDeviceFds, vfioFdName)); + + virCommandPassFD(cmd, vfiofd, VIR_COMMAND_PASS_FD_CLOSE_PARENT);
This would become just: qemuDomainHostdevPrivate *priv = (qemuDomainHostdevPrivate *)vsock->privateData; virCommandPassFD(cmd, priv->vfiofd, VIR_COMMAND_PASS_FD_CLOSE_PARENT);
+ } + + if (!(devprops = qemuBuildPCIHostdevDevProps(def, hostdev, vm))) return -1;
if (qemuBuildDeviceCommandlineFromJSON(cmd, devprops, def, qemuCaps) < 0) @@ -10893,7 +10928,7 @@ qemuBuildCommandLine(virDomainObj *vm, if (qemuBuildRedirdevCommandLine(cmd, def, qemuCaps) < 0) return NULL;
- if (qemuBuildHostdevCommandLine(cmd, def, qemuCaps) < 0) + if (qemuBuildHostdevCommandLine(cmd, def, qemuCaps, vm) < 0) return NULL;
if (migrateURI) diff --git a/src/qemu/qemu_command.h b/src/qemu/qemu_command.h index ad068f1f16..380aac261f 100644 --- a/src/qemu/qemu_command.h +++ b/src/qemu/qemu_command.h @@ -180,7 +180,8 @@ qemuBuildThreadContextProps(virJSONValue **tcProps, /* Current, best practice */ virJSONValue * qemuBuildPCIHostdevDevProps(const virDomainDef *def, - virDomainHostdevDef *dev); + virDomainHostdevDef *dev, + virDomainObj *vm);
virJSONValue * qemuBuildRNGDevProps(const virDomainDef *def, diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index a42721efad..86640aa3e3 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -1953,6 +1953,11 @@ qemuDomainObjPrivateFree(void *data)
virChrdevFree(priv->devs);
+ if (priv->iommufd >= 0) { + virEventRemoveHandle(priv->iommufd);
There is no handle to remove (and none is needed). So no need for the condition either.
+ priv->iommufd = -1; + } + if (priv->pidMonitored >= 0) { virEventRemoveHandle(priv->pidMonitored); priv->pidMonitored = -1; diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 45fc32a663..cecfed94a7 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -25,6 +25,7 @@ #include <unistd.h> #include <signal.h> #include <sys/stat.h> +#include <dirent.h>
We should not need this in qemu_process.
#if WITH_SYS_SYSCALL_H # include <sys/syscall.h> #endif @@ -8091,6 +8092,9 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuExtDevicesStart(driver, vm, incomingMigrationExtDevices) < 0) goto cleanup;
+ if (qemuProcessOpenVfioFds(vm) < 0) + goto cleanup; + if (!(cmd = qemuBuildCommandLine(vm, incoming ? "defer" : NULL, vmop, @@ -10267,3 +10271,231 @@ qemuProcessHandleNbdkitExit(qemuNbdkitProcess *nbdkit, qemuProcessEventSubmit(vm, QEMU_PROCESS_EVENT_NBDKIT_EXITED, 0, 0, nbdkit); virObjectUnlock(vm); } + +/** + * qemuProcessOpenIommuFd: + * @vm: domain object + * @iommuFd: returned file descriptor + * + * Opens /dev/iommu file descriptor for the VM. + * + * Returns: 0 on success, -1 on failure + */ +static int +qemuProcessOpenIommuFd(virDomainObj *vm, int *iommuFd) +{ + int fd = -1; + + VIR_DEBUG("Opening IOMMU FD for domain %s", vm->def->name); + + if ((fd = open("/dev/iommu", O_RDWR | O_CLOEXEC)) < 0) { + if (errno == ENOENT) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("IOMMU FD support requires /dev/iommu device")); + } else { + virReportSystemError(errno, "%s", + _("cannot open /dev/iommu")); + } + return -1; + } + + *iommuFd = fd; + VIR_DEBUG("Opened IOMMU FD %d for domain %s", fd, vm->def->name); + return 0; +} + +/** + * qemuProcessGetVfioDevicePath: + * @hostdev: host device definition + * @vfioPath: returned VFIO device path + * + * Constructs the VFIO device path for a PCI hostdev. + * + * Returns: 0 on success, -1 on failure + */ +static int +qemuProcessGetVfioDevicePath(virDomainHostdevDef *hostdev,
No need to pass the whole hostdev here. Then this function can live in virpci.c
+ char **vfioPath) +{ + virPCIDeviceAddress *addr; + g_autofree char *sysfsPath = NULL; + DIR *dir = NULL; + struct dirent *entry = NULL; + int ret = -1; +
Jano
On 11/6/2025 11:19 AM, Ján Tomko wrote:
Open iommufd FDs from libvirt backend without exposing these FDs to XML users, i.e. one per domain for /dev/iommu and one per iommufd hostdev for /dev/vfio/devices/vfioX, and pass the FD to qemu command line.
The part formatting the object and the part formatting the device should be split.
Sounds good, I will split it into two commits.
Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/qemu/qemu_command.c | 43 +++++++- src/qemu/qemu_command.h | 3 +- src/qemu/qemu_domain.c | 8 ++ src/qemu/qemu_domain.h | 7 ++ src/qemu/qemu_hotplug.c | 2 +- src/qemu/qemu_process.c | 232 ++++++++++++++++++++++++++++++++++++++++ 6 files changed, 289 insertions(+), 6 deletions(-)
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 8fd7527645..740a6970f2 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -4730,7 +4730,8 @@ qemuBuildVideoCommandLine(virCommand *cmd,
virJSONValue * qemuBuildPCIHostdevDevProps(const virDomainDef *def, - virDomainHostdevDef *dev) + virDomainHostdevDef *dev, + virDomainObj *vm)
Hmm, perhaps exposing the iommufd object in the XML would save us from having to pass this.
We are passing virDomainObj to this function in order to retrieve the you referring to exposing the
{ g_autoptr(virJSONValue) props = NULL; virDomainHostdevSubsysPCI *pcisrc = &dev->source.subsys.u.pci; @@ -4741,6 +4742,13 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, const char *iommufdId = NULL; /* 'ramfb' property must be omitted unless it's to be enabled */ bool ramfb = pcisrc->ramfb == VIR_TRISTATE_SWITCH_ON; + bool useIommufd = false; + qemuDomainObjPrivate *priv = vm ? vm->privateData : NULL; + + if (pcisrc->driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO && + pcisrc->driver.iommufd) { + useIommufd = true; + }
/* caller has to assign proper passthrough driver name */ switch (pcisrc->driver.name) { @@ -4787,6 +4795,18 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, NULL) < 0) return NULL;
addr.domain, pcisrc->addr.bus, + pcisrc- addr.slot, pcisrc->addr.function);
+ if (useIommufd && priv) { + g_autofree char *vfioFdName = g_strdup_printf("vfio-%04x: %02x:%02x.%d", + pcisrc- +
There's no need to duplicate the list of hostdevs which use iommufd in a per-domain hash table.
For storing per-device file descriptors, we have per-device private data.
vfioDeviceFds, vfioFdName)); + if (virJSONValueObjectAdd(&props, + "S:fd", g_strdup_printf("%d", vfiofd), + NULL) < 0) + return NULL; + }
+ int vfiofd = GPOINTER_TO_INT(g_hash_table_lookup(priv- + if (qemuBuildDeviceAddressProps(props, def, dev->info) < 0) return NULL;
@@ -5197,12 +5217,14 @@ qemuBuildAcpiNodesetProps(virCommand *cmd, static int qemuBuildHostdevCommandLine(virCommand *cmd, const virDomainDef *def, - virQEMUCaps *qemuCaps) + virQEMUCaps *qemuCaps, + virDomainObj *vm) { size_t i; g_autoptr(virJSONValue) props = NULL; int iommufd = 0; const char * iommufdId = "iommufd0"; + qemuDomainObjPrivate *priv = vm->privateData;
for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevDef *hostdev = def->hostdevs[i]; @@ -5233,8 +5255,10 @@ qemuBuildHostdevCommandLine(virCommand *cmd,
if (subsys->u.pci.driver.iommufd && iommufd == 0) { iommufd = 1; + virCommandPassFD(cmd, priv->iommufd, VIR_COMMAND_PASS_FD_CLOSE_PARENT); if (qemuMonitorCreateObjectProps(&props, "iommufd", iommufdId, + "S:fd", g_strdup_printf("%d", priv->iommufd), NULL) < 0) return -1;
@@ -5245,7 +5269,18 @@ qemuBuildHostdevCommandLine(virCommand *cmd, if (qemuCommandAddExtDevice(cmd, hostdev->info, def, qemuCaps) < 0) return -1;
source.subsys.u.pci; + g_autofree char *vfioFdName = g_strdup_printf("vfio- %04x:%02x:%02x.%d", + pcisrc- addr.domain, pcisrc->addr.bus, + pcisrc- addr.slot, pcisrc->addr.function);
- if (!(devprops = qemuBuildPCIHostdevDevProps(def, hostdev))) + if (subsys->u.pci.driver.iommufd) { + virDomainHostdevSubsysPCI *pcisrc = &hostdev- + + int vfiofd = GPOINTER_TO_INT(g_hash_table_lookup(priv->vfioDeviceFds, vfioFdName)); + + virCommandPassFD(cmd, vfiofd, VIR_COMMAND_PASS_FD_CLOSE_PARENT);
This would become just:
qemuDomainHostdevPrivate *priv = (qemuDomainHostdevPrivate *)vsock-
privateData;
virCommandPassFD(cmd, priv->vfiofd, VIR_COMMAND_PASS_FD_CLOSE_PARENT);
+ } + + if (!(devprops = qemuBuildPCIHostdevDevProps(def, hostdev, vm))) return -1;
if (qemuBuildDeviceCommandlineFromJSON(cmd, devprops, def, qemuCaps) < 0) @@ -10893,7 +10928,7 @@ qemuBuildCommandLine(virDomainObj *vm, if (qemuBuildRedirdevCommandLine(cmd, def, qemuCaps) < 0) return NULL;
- if (qemuBuildHostdevCommandLine(cmd, def, qemuCaps) < 0) + if (qemuBuildHostdevCommandLine(cmd, def, qemuCaps, vm) < 0) return NULL;
if (migrateURI) diff --git a/src/qemu/qemu_command.h b/src/qemu/qemu_command.h index ad068f1f16..380aac261f 100644 --- a/src/qemu/qemu_command.h +++ b/src/qemu/qemu_command.h @@ -180,7 +180,8 @@ qemuBuildThreadContextProps(virJSONValue **tcProps, /* Current, best practice */ virJSONValue * qemuBuildPCIHostdevDevProps(const virDomainDef *def, - virDomainHostdevDef *dev); + virDomainHostdevDef *dev, + virDomainObj *vm);
virJSONValue * qemuBuildRNGDevProps(const virDomainDef *def, diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index a42721efad..86640aa3e3 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -1953,6 +1953,11 @@ qemuDomainObjPrivateFree(void *data)
virChrdevFree(priv->devs);
+ if (priv->iommufd >= 0) { + virEventRemoveHandle(priv->iommufd);
There is no handle to remove (and none is needed). So no need for the condition either.
+ priv->iommufd = -1; + } + if (priv->pidMonitored >= 0) { virEventRemoveHandle(priv->pidMonitored); priv->pidMonitored = -1; diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 45fc32a663..cecfed94a7 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -25,6 +25,7 @@ #include <unistd.h> #include <signal.h> #include <sys/stat.h> +#include <dirent.h>
We should not need this in qemu_process.
#if WITH_SYS_SYSCALL_H # include <sys/syscall.h> #endif @@ -8091,6 +8092,9 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuExtDevicesStart(driver, vm, incomingMigrationExtDevices) < 0) goto cleanup;
+ if (qemuProcessOpenVfioFds(vm) < 0) + goto cleanup; + if (!(cmd = qemuBuildCommandLine(vm, incoming ? "defer" : NULL, vmop, @@ -10267,3 +10271,231 @@ qemuProcessHandleNbdkitExit(qemuNbdkitProcess *nbdkit, qemuProcessEventSubmit(vm, QEMU_PROCESS_EVENT_NBDKIT_EXITED, 0, 0, nbdkit); virObjectUnlock(vm); } + +/** + * qemuProcessOpenIommuFd: + * @vm: domain object + * @iommuFd: returned file descriptor + * + * Opens /dev/iommu file descriptor for the VM. + * + * Returns: 0 on success, -1 on failure + */ +static int +qemuProcessOpenIommuFd(virDomainObj *vm, int *iommuFd) +{ + int fd = -1; + + VIR_DEBUG("Opening IOMMU FD for domain %s", vm->def->name); + + if ((fd = open("/dev/iommu", O_RDWR | O_CLOEXEC)) < 0) { + if (errno == ENOENT) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("IOMMU FD support requires /dev/iommu device")); + } else { + virReportSystemError(errno, "%s", + _("cannot open /dev/iommu")); + } + return -1; + } + + *iommuFd = fd; + VIR_DEBUG("Opened IOMMU FD %d for domain %s", fd, vm->def->name); + return 0; +} + +/** + * qemuProcessGetVfioDevicePath: + * @hostdev: host device definition + * @vfioPath: returned VFIO device path + * + * Constructs the VFIO device path for a PCI hostdev. + * + * Returns: 0 on success, -1 on failure + */ +static int +qemuProcessGetVfioDevicePath(virDomainHostdevDef *hostdev,
No need to pass the whole hostdev here. Then this function can live in virpci.c
+ char **vfioPath) +{ + virPCIDeviceAddress *addr; + g_autofree char *sysfsPath = NULL; + DIR *dir = NULL; + struct dirent *entry = NULL; + int ret = -1; +
On 11/6/2025 11:19 AM, Ján Tomko wrote:
Open iommufd FDs from libvirt backend without exposing these FDs to XML users, i.e. one per domain for /dev/iommu and one per iommufd hostdev for /dev/vfio/devices/vfioX, and pass the FD to qemu command line.
The part formatting the object and the part formatting the device should be split.
Sounds good, I will split it into two commits. >> Signed-off-by: Nathan Chen <nathanc@nvidia.com>
--- src/qemu/qemu_command.c | 43 +++++++- src/qemu/qemu_command.h | 3 +- src/qemu/qemu_domain.c | 8 ++ src/qemu/qemu_domain.h | 7 ++ src/qemu/qemu_hotplug.c | 2 +- src/qemu/qemu_process.c | 232 ++++++++++++++++++++++++++++++++++++++++ 6 files changed, 289 insertions(+), 6 deletions(-)
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 8fd7527645..740a6970f2 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -4730,7 +4730,8 @@ qemuBuildVideoCommandLine(virCommand *cmd,
virJSONValue * qemuBuildPCIHostdevDevProps(const virDomainDef *def, - virDomainHostdevDef *dev) + virDomainHostdevDef *dev, + virDomainObj *vm)
Hmm, perhaps exposing the iommufd object in the XML would save us from having to pass this.
We are passing virDomainObj to this function in order to retrieve the FD number, would you mind clarifying how we could avoid passing this by exposing the iommufd object in the XML? It is my understanding that exposing the iommufd object ID would still mean we need pass the virDomainObj.
{ g_autoptr(virJSONValue) props = NULL; virDomainHostdevSubsysPCI *pcisrc = &dev->source.subsys.u.pci; @@ -4741,6 +4742,13 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, const char *iommufdId = NULL; /* 'ramfb' property must be omitted unless it's to be enabled */ bool ramfb = pcisrc->ramfb == VIR_TRISTATE_SWITCH_ON; + bool useIommufd = false; + qemuDomainObjPrivate *priv = vm ? vm->privateData : NULL; + + if (pcisrc->driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO && + pcisrc->driver.iommufd) { + useIommufd = true; + }
/* caller has to assign proper passthrough driver name */ switch (pcisrc->driver.name) { @@ -4787,6 +4795,18 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, NULL) < 0) return NULL;
addr.domain, pcisrc->addr.bus, + pcisrc- addr.slot, pcisrc->addr.function);
+ if (useIommufd && priv) { + g_autofree char *vfioFdName = g_strdup_printf("vfio-%04x: %02x:%02x.%d", + pcisrc- +
There's no need to duplicate the list of hostdevs which use iommufd in a per-domain hash table.
For storing per-device file descriptors, we have per-device private data.
vfioDeviceFds, vfioFdName)); + if (virJSONValueObjectAdd(&props, + "S:fd", g_strdup_printf("%d", vfiofd), + NULL) < 0) + return NULL; + }
+ int vfiofd = GPOINTER_TO_INT(g_hash_table_lookup(priv- + if (qemuBuildDeviceAddressProps(props, def, dev->info) < 0) return NULL;
@@ -5197,12 +5217,14 @@ qemuBuildAcpiNodesetProps(virCommand *cmd, static int qemuBuildHostdevCommandLine(virCommand *cmd, const virDomainDef *def, - virQEMUCaps *qemuCaps) + virQEMUCaps *qemuCaps, + virDomainObj *vm) { size_t i; g_autoptr(virJSONValue) props = NULL; int iommufd = 0; const char * iommufdId = "iommufd0"; + qemuDomainObjPrivate *priv = vm->privateData;
for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevDef *hostdev = def->hostdevs[i]; @@ -5233,8 +5255,10 @@ qemuBuildHostdevCommandLine(virCommand *cmd,
if (subsys->u.pci.driver.iommufd && iommufd == 0) { iommufd = 1; + virCommandPassFD(cmd, priv->iommufd, VIR_COMMAND_PASS_FD_CLOSE_PARENT); if (qemuMonitorCreateObjectProps(&props, "iommufd", iommufdId, + "S:fd", g_strdup_printf("%d", priv->iommufd), NULL) < 0) return -1;
@@ -5245,7 +5269,18 @@ qemuBuildHostdevCommandLine(virCommand *cmd, if (qemuCommandAddExtDevice(cmd, hostdev->info, def, qemuCaps) < 0) return -1;
source.subsys.u.pci; + g_autofree char *vfioFdName = g_strdup_printf("vfio- %04x:%02x:%02x.%d", + pcisrc- addr.domain, pcisrc->addr.bus, + pcisrc- addr.slot, pcisrc->addr.function);
- if (!(devprops = qemuBuildPCIHostdevDevProps(def, hostdev))) + if (subsys->u.pci.driver.iommufd) { + virDomainHostdevSubsysPCI *pcisrc = &hostdev- + + int vfiofd = GPOINTER_TO_INT(g_hash_table_lookup(priv->vfioDeviceFds, vfioFdName)); + + virCommandPassFD(cmd, vfiofd, VIR_COMMAND_PASS_FD_CLOSE_PARENT);
This would become just:
qemuDomainHostdevPrivate *priv = (qemuDomainHostdevPrivate *)vsock-
privateData;
virCommandPassFD(cmd, priv->vfiofd, VIR_COMMAND_PASS_FD_CLOSE_PARENT);
I will proceed with implementing a qemuDomainHostdevPrivate struct and look into the existing implementation for qemuDomainDiskPrivate for reference. I was not able to find a private data attribute in the _virDomainHostdevDef struct definition, but I do see "virObject *privateData;" under the _virDomainDiskDef struct definition - are you aware of any reason behind this, or has it just never been needed for the hostdev struct?
+ } + + if (!(devprops = qemuBuildPCIHostdevDevProps(def, hostdev, vm))) return -1;
if (qemuBuildDeviceCommandlineFromJSON(cmd, devprops, def, qemuCaps) < 0) @@ -10893,7 +10928,7 @@ qemuBuildCommandLine(virDomainObj *vm, if (qemuBuildRedirdevCommandLine(cmd, def, qemuCaps) < 0) return NULL;
- if (qemuBuildHostdevCommandLine(cmd, def, qemuCaps) < 0) + if (qemuBuildHostdevCommandLine(cmd, def, qemuCaps, vm) < 0) return NULL;
if (migrateURI) diff --git a/src/qemu/qemu_command.h b/src/qemu/qemu_command.h index ad068f1f16..380aac261f 100644 --- a/src/qemu/qemu_command.h +++ b/src/qemu/qemu_command.h @@ -180,7 +180,8 @@ qemuBuildThreadContextProps(virJSONValue **tcProps, /* Current, best practice */ virJSONValue * qemuBuildPCIHostdevDevProps(const virDomainDef *def, - virDomainHostdevDef *dev); + virDomainHostdevDef *dev, + virDomainObj *vm);
virJSONValue * qemuBuildRNGDevProps(const virDomainDef *def, diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index a42721efad..86640aa3e3 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -1953,6 +1953,11 @@ qemuDomainObjPrivateFree(void *data)
virChrdevFree(priv->devs);
+ if (priv->iommufd >= 0) { + virEventRemoveHandle(priv->iommufd);
There is no handle to remove (and none is needed). So no need for the condition either.
Thanks for catching this, agreed that the other logic to close iommufd file descriptor is sufficient and this is not needed.
+ priv->iommufd = -1; + } + if (priv->pidMonitored >= 0) { virEventRemoveHandle(priv->pidMonitored); priv->pidMonitored = -1; diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 45fc32a663..cecfed94a7 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -25,6 +25,7 @@ #include <unistd.h> #include <signal.h> #include <sys/stat.h> +#include <dirent.h>
We should not need this in qemu_process.
This must have been left over from a previous implementation attempt. I will remove this in the next revision. Thanks
#if WITH_SYS_SYSCALL_H # include <sys/syscall.h> #endif @@ -8091,6 +8092,9 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuExtDevicesStart(driver, vm, incomingMigrationExtDevices) < 0) goto cleanup;
+ if (qemuProcessOpenVfioFds(vm) < 0) + goto cleanup; + if (!(cmd = qemuBuildCommandLine(vm, incoming ? "defer" : NULL, vmop, @@ -10267,3 +10271,231 @@ qemuProcessHandleNbdkitExit(qemuNbdkitProcess *nbdkit, qemuProcessEventSubmit(vm, QEMU_PROCESS_EVENT_NBDKIT_EXITED, 0, 0, nbdkit); virObjectUnlock(vm); } + +/** + * qemuProcessOpenIommuFd: + * @vm: domain object + * @iommuFd: returned file descriptor + * + * Opens /dev/iommu file descriptor for the VM. + * + * Returns: 0 on success, -1 on failure + */ +static int +qemuProcessOpenIommuFd(virDomainObj *vm, int *iommuFd) +{ + int fd = -1; + + VIR_DEBUG("Opening IOMMU FD for domain %s", vm->def->name); + + if ((fd = open("/dev/iommu", O_RDWR | O_CLOEXEC)) < 0) { + if (errno == ENOENT) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("IOMMU FD support requires /dev/iommu device")); + } else { + virReportSystemError(errno, "%s", + _("cannot open /dev/iommu")); + } + return -1; + } + + *iommuFd = fd; + VIR_DEBUG("Opened IOMMU FD %d for domain %s", fd, vm->def->name); + return 0; +} + +/** + * qemuProcessGetVfioDevicePath: + * @hostdev: host device definition + * @vfioPath: returned VFIO device path + * + * Constructs the VFIO device path for a PCI hostdev. + * + * Returns: 0 on success, -1 on failure + */ +static int +qemuProcessGetVfioDevicePath(virDomainHostdevDef *hostdev,
No need to pass the whole hostdev here. Then this function can live in virpci.c
Sounds good, I will move this to virpci.c and just pass in the device address. -Nathan
On a Thursday in 2025, Nathan Chen wrote:
On 11/6/2025 11:19 AM, Ján Tomko wrote:
Open iommufd FDs from libvirt backend without exposing these FDs to XML users, i.e. one per domain for /dev/iommu and one per iommufd hostdev for /dev/vfio/devices/vfioX, and pass the FD to qemu command line.
The part formatting the object and the part formatting the device should be split.
Sounds good, I will split it into two commits. >> Signed-off-by: Nathan Chen <nathanc@nvidia.com>
--- src/qemu/qemu_command.c | 43 +++++++- src/qemu/qemu_command.h | 3 +- src/qemu/qemu_domain.c | 8 ++ src/qemu/qemu_domain.h | 7 ++ src/qemu/qemu_hotplug.c | 2 +- src/qemu/qemu_process.c | 232 ++++++++++++++++++++++++++++++++++++++++ 6 files changed, 289 insertions(+), 6 deletions(-)
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 8fd7527645..740a6970f2 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -4730,7 +4730,8 @@ qemuBuildVideoCommandLine(virCommand *cmd,
virJSONValue * qemuBuildPCIHostdevDevProps(const virDomainDef *def, - virDomainHostdevDef *dev) + virDomainHostdevDef *dev, + virDomainObj *vm)
Hmm, perhaps exposing the iommufd object in the XML would save us from having to pass this.
We are passing virDomainObj to this function in order to retrieve the FD number, would you mind clarifying how we could avoid passing this by exposing the iommufd object in the XML? It is my understanding that exposing the iommufd object ID would still mean we need pass the virDomainObj.
If it was a separate device, it would have its own data type and own formatting function unrelated to hostdevs.
{ g_autoptr(virJSONValue) props = NULL; virDomainHostdevSubsysPCI *pcisrc = &dev->source.subsys.u.pci; @@ -4741,6 +4742,13 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, const char *iommufdId = NULL; /* 'ramfb' property must be omitted unless it's to be enabled */ bool ramfb = pcisrc->ramfb == VIR_TRISTATE_SWITCH_ON; + bool useIommufd = false; + qemuDomainObjPrivate *priv = vm ? vm->privateData : NULL; + + if (pcisrc->driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO && + pcisrc->driver.iommufd) { + useIommufd = true; + }
/* caller has to assign proper passthrough driver name */ switch (pcisrc->driver.name) { @@ -4787,6 +4795,18 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, NULL) < 0) return NULL;
addr.domain, pcisrc->addr.bus, + pcisrc- addr.slot, pcisrc->addr.function);
+ if (useIommufd && priv) { + g_autofree char *vfioFdName = g_strdup_printf("vfio-%04x: %02x:%02x.%d", + pcisrc- +
There's no need to duplicate the list of hostdevs which use iommufd in a per-domain hash table.
For storing per-device file descriptors, we have per-device private data.
vfioDeviceFds, vfioFdName)); + if (virJSONValueObjectAdd(&props, + "S:fd", g_strdup_printf("%d", vfiofd), + NULL) < 0) + return NULL; + }
+ int vfiofd = GPOINTER_TO_INT(g_hash_table_lookup(priv- + if (qemuBuildDeviceAddressProps(props, def, dev->info) < 0) return NULL;
@@ -5197,12 +5217,14 @@ qemuBuildAcpiNodesetProps(virCommand *cmd, static int qemuBuildHostdevCommandLine(virCommand *cmd, const virDomainDef *def, - virQEMUCaps *qemuCaps) + virQEMUCaps *qemuCaps, + virDomainObj *vm) { size_t i; g_autoptr(virJSONValue) props = NULL; int iommufd = 0; const char * iommufdId = "iommufd0"; + qemuDomainObjPrivate *priv = vm->privateData;
for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevDef *hostdev = def->hostdevs[i]; @@ -5233,8 +5255,10 @@ qemuBuildHostdevCommandLine(virCommand *cmd,
if (subsys->u.pci.driver.iommufd && iommufd == 0) { iommufd = 1; + virCommandPassFD(cmd, priv->iommufd, VIR_COMMAND_PASS_FD_CLOSE_PARENT); if (qemuMonitorCreateObjectProps(&props, "iommufd", iommufdId, + "S:fd", g_strdup_printf("%d", priv->iommufd), NULL) < 0) return -1;
@@ -5245,7 +5269,18 @@ qemuBuildHostdevCommandLine(virCommand *cmd, if (qemuCommandAddExtDevice(cmd, hostdev->info, def, qemuCaps) < 0) return -1;
source.subsys.u.pci; + g_autofree char *vfioFdName = g_strdup_printf("vfio- %04x:%02x:%02x.%d", +
- if (!(devprops = qemuBuildPCIHostdevDevProps(def, hostdev))) + if (subsys->u.pci.driver.iommufd) { + virDomainHostdevSubsysPCI *pcisrc = &hostdev- pcisrc- >addr.domain, pcisrc->addr.bus, + pcisrc- >addr.slot, pcisrc->addr.function); + + int vfiofd = GPOINTER_TO_INT(g_hash_table_lookup(priv->vfioDeviceFds, vfioFdName)); + + virCommandPassFD(cmd, vfiofd, VIR_COMMAND_PASS_FD_CLOSE_PARENT);
This would become just:
qemuDomainHostdevPrivate *priv = (qemuDomainHostdevPrivate *)vsock-
privateData;
virCommandPassFD(cmd, priv->vfiofd, VIR_COMMAND_PASS_FD_CLOSE_PARENT);
I will proceed with implementing a qemuDomainHostdevPrivate struct and look into the existing implementation for qemuDomainDiskPrivate for reference. I was not able to find a private data attribute in the _virDomainHostdevDef struct definition, but I do see "virObject *privateData;" under the _virDomainDiskDef struct definition - are you aware of any reason behind this, or has it just never been needed for the hostdev struct?
It was added for disks at the time when it was needed. Jano
+ } + + if (!(devprops = qemuBuildPCIHostdevDevProps(def, hostdev, vm))) return -1;
On 11/7/2025 4:40 AM, Ján Tomko wrote:
On a Thursday in 2025, Nathan Chen wrote:
On 11/6/2025 11:19 AM, Ján Tomko wrote:
Open iommufd FDs from libvirt backend without exposing these FDs to XML users, i.e. one per domain for /dev/iommu and one per iommufd hostdev for /dev/vfio/devices/vfioX, and pass the FD to qemu command line.
The part formatting the object and the part formatting the device should be split.
Sounds good, I will split it into two commits. >> Signed-off-by: Nathan Chen <nathanc@nvidia.com>
--- src/qemu/qemu_command.c | 43 +++++++- src/qemu/qemu_command.h | 3 +- src/qemu/qemu_domain.c | 8 ++ src/qemu/qemu_domain.h | 7 ++ src/qemu/qemu_hotplug.c | 2 +- src/qemu/qemu_process.c | 232 ++++++++++++++++++++++++++++++++++++++++ 6 files changed, 289 insertions(+), 6 deletions(-)
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 8fd7527645..740a6970f2 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -4730,7 +4730,8 @@ qemuBuildVideoCommandLine(virCommand *cmd,
virJSONValue * qemuBuildPCIHostdevDevProps(const virDomainDef *def, - virDomainHostdevDef *dev) + virDomainHostdevDef *dev, + virDomainObj *vm)
Hmm, perhaps exposing the iommufd object in the XML would save us from having to pass this.
We are passing virDomainObj to this function in order to retrieve the FD number, would you mind clarifying how we could avoid passing this by exposing the iommufd object in the XML? It is my understanding that exposing the iommufd object ID would still mean we need pass the virDomainObj.
If it was a separate device, it would have its own data type and own formatting function unrelated to hostdevs.
That makes sense, like passing a new virDomainIommufdDef struct pointer to a new qemuBuildIommufdDevProps() function. Previously we implemented a virDomainIommufdDef struct as a member under virDomainIOMMUDef [0] to store the iommufd ID and FD number. But we changed the implementation adn XML representation to only be a bool member associated with hostdevs. What are your thoughts on either of the following paths forward? 1. Re-implementing the virDomainIommufdDef struct as a virDomainHostdevDef member, storing the FD numbers as part of a private data struct member in virDomainIommufdDef 2. Storing the FD numbers as part of a private data struct member in virDomainHostdevDef, and avoiding implementing a separate virDomainIommufdDef struct. Option 2 seems the most straightforward t ome, still allowing us to avoid passing the virDomainObj to qemuBuildPCIHostdevDevPRops(), but please let me know what you think. [0] https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/EASBQ... Thanks, Nathan
On a Friday in 2025, Nathan Chen wrote:
On 11/7/2025 4:40 AM, Ján Tomko wrote:
On a Thursday in 2025, Nathan Chen wrote:
On 11/6/2025 11:19 AM, Ján Tomko wrote:
Open iommufd FDs from libvirt backend without exposing these FDs to XML users, i.e. one per domain for /dev/iommu and one per iommufd hostdev for /dev/vfio/devices/vfioX, and pass the FD to qemu command line.
The part formatting the object and the part formatting the device should be split.
Sounds good, I will split it into two commits. >> Signed-off-by: Nathan Chen <nathanc@nvidia.com>
--- src/qemu/qemu_command.c | 43 +++++++- src/qemu/qemu_command.h | 3 +- src/qemu/qemu_domain.c | 8 ++ src/qemu/qemu_domain.h | 7 ++ src/qemu/qemu_hotplug.c | 2 +- src/qemu/qemu_process.c | 232 ++++++++++++++++++++++++++++++++++++++++ 6 files changed, 289 insertions(+), 6 deletions(-)
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 8fd7527645..740a6970f2 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -4730,7 +4730,8 @@ qemuBuildVideoCommandLine(virCommand *cmd,
virJSONValue * qemuBuildPCIHostdevDevProps(const virDomainDef *def, - virDomainHostdevDef *dev) + virDomainHostdevDef *dev, + virDomainObj *vm)
Hmm, perhaps exposing the iommufd object in the XML would save us from having to pass this.
We are passing virDomainObj to this function in order to retrieve the FD number, would you mind clarifying how we could avoid passing this by exposing the iommufd object in the XML? It is my understanding that exposing the iommufd object ID would still mean we need pass the virDomainObj.
If it was a separate device, it would have its own data type and own formatting function unrelated to hostdevs.
That makes sense, like passing a new virDomainIommufdDef struct pointer to a new qemuBuildIommufdDevProps() function. Previously we implemented a virDomainIommufdDef struct as a member under virDomainIOMMUDef [0] to store the iommufd ID and FD number. But we changed the implementation adn XML representation to only be a bool member associated with hostdevs. What are your thoughts on either of the following paths forward?
1. Re-implementing the virDomainIommufdDef struct as a virDomainHostdevDef member, storing the FD numbers as part of a private data struct member in virDomainIommufdDef 2. Storing the FD numbers as part of a private data struct member in virDomainHostdevDef, and avoiding implementing a separate virDomainIommufdDef struct.
Option 2 seems the most straightforward t ome, still allowing us to avoid passing the virDomainObj to qemuBuildPCIHostdevDevPRops(), but please let me know what you think.
Oops, I missed this question. Option 2 sounds better to me. Jano
[0] https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/EASBQ...
Thanks, Nathan
Allow access to /dev/iommu and /dev/vfio/devices/vfio* when launching a qemu VM with iommufd feature enabled. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/qemu/qemu_cgroup.c | 61 ++++++++++++++++++++++++++++ src/qemu/qemu_cgroup.h | 1 + src/qemu/qemu_namespace.c | 44 +++++++++++++++++++++ src/security/security_apparmor.c | 15 +++++++ src/security/security_dac.c | 34 ++++++++++++++++ src/security/security_selinux.c | 34 ++++++++++++++++ src/security/virt-aa-helper.c | 11 +++++- src/util/virpci.c | 68 ++++++++++++++++++++++++++++++++ src/util/virpci.h | 1 + 9 files changed, 268 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 46a7dc1d8b..e15ffd2007 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -461,6 +461,54 @@ qemuTeardownInputCgroup(virDomainObj *vm, } +int +qemuSetupIommufdCgroup(virDomainObj *vm) +{ + qemuDomainObjPrivate *priv = vm->privateData; + g_autoptr(DIR) dir = NULL; + struct dirent *dent; + g_autofree char *path = NULL; + int iommufd = 0; + size_t i; + + for (i = 0; i < vm->def->nhostdevs; i++) { + if (vm->def->hostdevs[i]->source.subsys.u.pci.driver.iommufd) { + iommufd = 1; + break; + } + } + + if (iommufd == 1) { + if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_DEVICES)) + return 0; + if (virDirOpen(&dir, "/dev/vfio/devices") < 0) { + if (errno == ENOENT) + return 0; + return -1; + } + while (virDirRead(dir, &dent, "/dev/vfio/devices") > 0) { + if (STRPREFIX(dent->d_name, "vfio")) { + path = g_strdup_printf("/dev/vfio/devices/%s", dent->d_name); + } + if (path && + qemuCgroupAllowDevicePath(vm, path, + VIR_CGROUP_DEVICE_RW, false) < 0) { + return -1; + } + path = NULL; + } + if (virFileExists("/dev/iommu")) + path = g_strdup("/dev/iommu"); + if (path && + qemuCgroupAllowDevicePath(vm, path, + VIR_CGROUP_DEVICE_RW, false) < 0) { + return -1; + } + } + return 0; +} + + /** * qemuSetupHostdevCgroup: * vm: domain object @@ -759,6 +807,7 @@ qemuSetupDevicesCgroup(virDomainObj *vm) g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(priv->driver); const char *const *deviceACL = (const char *const *) cfg->cgroupDeviceACL; int rv = -1; + int iommufd = 0; size_t i; if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_DEVICES)) @@ -836,6 +885,18 @@ qemuSetupDevicesCgroup(virDomainObj *vm) return -1; } + for (i = 0; i < vm->def->nhostdevs; i++) { + if (vm->def->hostdevs[i]->source.subsys.u.pci.driver.iommufd) { + iommufd = 1; + break; + } + } + + if (iommufd == 1) { + if (qemuSetupIommufdCgroup(vm) < 0) + return -1; + } + for (i = 0; i < vm->def->nmems; i++) { if (qemuSetupMemoryDevicesCgroup(vm, vm->def->mems[i]) < 0) return -1; diff --git a/src/qemu/qemu_cgroup.h b/src/qemu/qemu_cgroup.h index 3668034cde..bea677ba3c 100644 --- a/src/qemu/qemu_cgroup.h +++ b/src/qemu/qemu_cgroup.h @@ -42,6 +42,7 @@ int qemuSetupHostdevCgroup(virDomainObj *vm, int qemuTeardownHostdevCgroup(virDomainObj *vm, virDomainHostdevDef *dev) G_GNUC_WARN_UNUSED_RESULT; +int qemuSetupIommufdCgroup(virDomainObj *vm); int qemuSetupMemoryDevicesCgroup(virDomainObj *vm, virDomainMemoryDef *mem); int qemuTeardownMemoryDevicesCgroup(virDomainObj *vm, diff --git a/src/qemu/qemu_namespace.c b/src/qemu/qemu_namespace.c index 932777505b..80496f2f0f 100644 --- a/src/qemu/qemu_namespace.c +++ b/src/qemu/qemu_namespace.c @@ -683,6 +683,47 @@ qemuDomainSetupLaunchSecurity(virDomainObj *vm, } +static int +qemuDomainSetupIommufd(virDomainObj *vm, + GSList **paths) +{ + g_autoptr(DIR) dir = NULL; + struct dirent *dent; + g_autofree char *path = NULL; + int iommufd = 0; + size_t i; + + for (i = 0; i < vm->def->nhostdevs; i++) { + if (vm->def->hostdevs[i]->source.subsys.u.pci.driver.iommufd) { + iommufd = 1; + break; + } + } + + /* Check if iommufd is enabled */ + if (iommufd == 1) { + if (virDirOpen(&dir, "/dev/vfio/devices") < 0) { + if (errno == ENOENT) + return 0; + return -1; + } + while (virDirRead(dir, &dent, "/dev/vfio/devices") > 0) { + if (STRPREFIX(dent->d_name, "vfio")) { + path = g_strdup_printf("/dev/vfio/devices/%s", dent->d_name); + *paths = g_slist_prepend(*paths, g_steal_pointer(&path)); + } + } + path = NULL; + if (virFileExists("/dev/iommu")) + path = g_strdup("/dev/iommu"); + if (path) + *paths = g_slist_prepend(*paths, g_steal_pointer(&path)); + } + + return 0; +} + + static int qemuNamespaceMknodPaths(virDomainObj *vm, GSList *paths, @@ -706,6 +747,9 @@ qemuDomainBuildNamespace(virQEMUDriverConfig *cfg, if (qemuDomainSetupAllDisks(vm, &paths) < 0) return -1; + if (qemuDomainSetupIommufd(vm, &paths) < 0) + return -1; + if (qemuDomainSetupAllHostdevs(vm, &paths) < 0) return -1; diff --git a/src/security/security_apparmor.c b/src/security/security_apparmor.c index 68ac39611f..0a878fd205 100644 --- a/src/security/security_apparmor.c +++ b/src/security/security_apparmor.c @@ -856,6 +856,21 @@ AppArmorSetSecurityHostdevLabel(virSecurityManager *mgr, } ret = AppArmorSetSecurityPCILabel(pci, vfioGroupDev, ptr); VIR_FREE(vfioGroupDev); + + if (dev->source.subsys.u.pci.driver.iommufd) { + g_autofree char *vfiofdDev = virPCIDeviceGetIOMMUFDDev(pci); + const char *iommufdDir = "/dev/iommu"; + if (vfiofdDev) { + int ret2 = AppArmorSetSecurityPCILabel(pci, vfiofdDev, ptr); + if (ret2 < 0) + ret = ret2; + ret2 = AppArmorSetSecurityPCILabel(pci, iommufdDir, ptr); + if (ret2 < 0) + ret = ret2; + } else { + return -1; + } + } } else { ret = virPCIDeviceFileIterate(pci, AppArmorSetSecurityPCILabel, ptr); } diff --git a/src/security/security_dac.c b/src/security/security_dac.c index 2f788b872a..361106222d 100644 --- a/src/security/security_dac.c +++ b/src/security/security_dac.c @@ -1290,6 +1290,24 @@ virSecurityDACSetHostdevLabel(virSecurityManager *mgr, ret = virSecurityDACSetHostdevLabelHelper(vfioGroupDev, false, &cbdata); + if (dev->source.subsys.u.pci.driver.iommufd) { + g_autofree char *vfiofdDev = virPCIDeviceGetIOMMUFDDev(pci); + const char *iommufdDir = "/dev/iommu"; + if (vfiofdDev) { + int ret2 = virSecurityDACSetHostdevLabelHelper(vfiofdDev, + false, + &cbdata); + if (ret2 < 0) + ret = ret2; + ret2 = virSecurityDACSetHostdevLabelHelper(iommufdDir, + false, + &cbdata); + if (ret2 < 0) + ret = ret2; + } else { + return -1; + } + } } else { ret = virPCIDeviceFileIterate(pci, virSecurityDACSetPCILabel, @@ -1450,6 +1468,22 @@ virSecurityDACRestoreHostdevLabel(virSecurityManager *mgr, ret = virSecurityDACRestoreFileLabelInternal(mgr, NULL, vfioGroupDev, false); + if (dev->source.subsys.u.pci.driver.iommufd) { + g_autofree char *vfiofdDev = virPCIDeviceGetIOMMUFDDev(pci); + const char *iommufdDir = "/dev/iommu"; + if (vfiofdDev) { + int ret2 = virSecurityDACRestoreFileLabelInternal(mgr, NULL, + vfiofdDev, false); + if (ret2 < 0) + ret = ret2; + ret2 = virSecurityDACRestoreFileLabelInternal(mgr, NULL, + iommufdDir, false); + if (ret2 < 0) + ret = ret2; + } else { + return -1; + } + } } else { ret = virPCIDeviceFileIterate(pci, virSecurityDACRestorePCILabel, mgr); } diff --git a/src/security/security_selinux.c b/src/security/security_selinux.c index fa5d1568eb..fbe8f63ab4 100644 --- a/src/security/security_selinux.c +++ b/src/security/security_selinux.c @@ -2248,6 +2248,25 @@ virSecuritySELinuxSetHostdevSubsysLabel(virSecurityManager *mgr, ret = virSecuritySELinuxSetHostdevLabelHelper(vfioGroupDev, false, &data); + if (dev->source.subsys.u.pci.driver.iommufd) { + g_autofree char *vfiofdDev = virPCIDeviceGetIOMMUFDDev(pci); + const char *iommufdDir = "/dev/iommu"; + if (vfiofdDev) { + int ret2 = virSecuritySELinuxSetHostdevLabelHelper(vfiofdDev, + false, + &data); + if (ret2 < 0) + ret = ret2; + ret2 = virSecuritySELinuxSetHostdevLabelHelper(iommufdDir, + false, + &data); + if (ret2 < 0) + ret = ret2; + } else { + return -1; + } + } + } else { ret = virPCIDeviceFileIterate(pci, virSecuritySELinuxSetPCILabel, &data); } @@ -2481,6 +2500,21 @@ virSecuritySELinuxRestoreHostdevSubsysLabel(virSecurityManager *mgr, return -1; ret = virSecuritySELinuxRestoreFileLabel(mgr, vfioGroupDev, false); + + if (dev->source.subsys.u.pci.driver.iommufd) { + g_autofree char *vfiofdDev = virPCIDeviceGetIOMMUFDDev(pci); + const char *iommufdDir = "/dev/iommu"; + if (vfiofdDev) { + int ret2 = virSecuritySELinuxRestoreFileLabel(mgr, vfiofdDev, false); + if (ret2 < 0) + ret = ret2; + ret2 = virSecuritySELinuxRestoreFileLabel(mgr, iommufdDir, false); + if (ret2 < 0) + ret = ret2; + } else { + return -1; + } + } } else { ret = virPCIDeviceFileIterate(pci, virSecuritySELinuxRestorePCILabel, mgr); } diff --git a/src/security/virt-aa-helper.c b/src/security/virt-aa-helper.c index de0a826063..c9e6d9c6a9 100644 --- a/src/security/virt-aa-helper.c +++ b/src/security/virt-aa-helper.c @@ -878,7 +878,7 @@ get_files(vahControl * ctl) size_t i; g_autofree char *uuid = NULL; char uuidstr[VIR_UUID_STRING_BUFLEN]; - bool needsVfio = false, needsvhost = false, needsgl = false; + bool needsVfio = false, needsvhost = false, needsgl = false, needsIommufd = false; /* verify uuid is same as what we were given on the command line */ virUUIDFormat(ctl->def->uuid, uuidstr); @@ -1119,6 +1119,9 @@ get_files(vahControl * ctl) needsVfio = true; } + if (dev->source.subsys.u.pci.driver.iommufd) + needsIommufd = true; + if (pci == NULL) continue; @@ -1348,6 +1351,12 @@ get_files(vahControl * ctl) virBufferAddLit(&buf, " \"/dev/vfio/vfio\" rw,\n"); virBufferAddLit(&buf, " \"/dev/vfio/[0-9]*\" rw,\n"); } + + if (needsIommufd) { + virBufferAddLit(&buf, " \"/dev/iommu\" rwm,\n"); + virBufferAddLit(&buf, " \"/dev/vfio/devices/vfio[0-9]*\" rwm,\n"); + } + if (needsgl) { /* if using gl all sorts of further dri related paths will be needed */ virBufferAddLit(&buf, " # DRI/Mesa/(e)GL config and driver paths\n"); diff --git a/src/util/virpci.c b/src/util/virpci.c index 90617e69c6..6e6e5e47c0 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -2478,6 +2478,74 @@ virPCIDeviceGetIOMMUGroupDev(virPCIDevice *dev) return g_strdup_printf("/dev/vfio/%s", groupFile); } +/* virPCIDeviceGetIOMMUFDDev - return the name of the device used + * to control this PCI device's group (e.g. "/dev/vfio/devices/vfio15") + */ +char * +virPCIDeviceGetIOMMUFDDev(virPCIDevice *dev) +{ + g_autofree char *path = NULL; + const char *pci_addr = NULL; + g_autoptr(DIR) dir = NULL; + struct dirent *entry; + char *vfiodev = NULL; + + /* Get PCI device address */ + pci_addr = virPCIDeviceGetName(dev); + if (!pci_addr) + return NULL; + + /* First try: look in PCI device's vfio-dev subdirectory */ + path = g_strdup_printf("/sys/bus/pci/devices/%s/vfio-dev", pci_addr); + + if (virDirOpen(&dir, path) == 1) { + while (virDirRead(dir, &entry, path) > 0) { + if (!g_str_has_prefix(entry->d_name, "vfio")) + continue; + + vfiodev = g_strdup_printf("/dev/vfio/devices/%s", entry->d_name); + break; + } + /* g_autoptr will automatically close dir when it goes out of scope */ + dir = NULL; + } + + /* Second try: scan /sys/class/vfio-dev for matching device */ + if (!vfiodev) { + g_free(path); + path = g_strdup("/sys/class/vfio-dev"); + + if (virDirOpen(&dir, path) == 1) { + while (virDirRead(dir, &entry, path) > 0) { + g_autofree char *dev_link = NULL; + g_autofree char *target = NULL; + + if (!g_str_has_prefix(entry->d_name, "vfio")) + continue; + + dev_link = g_strdup_printf("/sys/class/vfio-dev/%s/device", entry->d_name); + + if (virFileResolveLink(dev_link, &target) < 0) + continue; + + if (strstr(target, pci_addr)) { + vfiodev = g_strdup_printf("/dev/vfio/devices/%s", entry->d_name); + break; + } + } + /* g_autoptr will automatically close dir */ + } + } + + /* Verify the device path exists and is accessible */ + if (vfiodev && !virFileExists(vfiodev)) { + VIR_FREE(vfiodev); + return NULL; + } + + return vfiodev; +} + static int virPCIDeviceDownstreamLacksACS(virPCIDevice *dev) { diff --git a/src/util/virpci.h b/src/util/virpci.h index fc538566e1..996ffab2f9 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -203,6 +203,7 @@ int virPCIDeviceAddressGetIOMMUGroupNum(virPCIDeviceAddress *addr); char *virPCIDeviceAddressGetIOMMUGroupDev(const virPCIDeviceAddress *devAddr); bool virPCIDeviceExists(const virPCIDeviceAddress *addr); char *virPCIDeviceGetIOMMUGroupDev(virPCIDevice *dev); +char *virPCIDeviceGetIOMMUFDDev(virPCIDevice *dev); int virPCIDeviceIsAssignable(virPCIDevice *dev, int strict_acs_check); -- 2.43.0
On a Monday in 2025, Nathan Chen via Devel wrote:
Allow access to /dev/iommu and /dev/vfio/devices/vfio* when launching a qemu VM with iommufd feature enabled.
Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/qemu/qemu_cgroup.c | 61 ++++++++++++++++++++++++++++ src/qemu/qemu_cgroup.h | 1 + src/qemu/qemu_namespace.c | 44 +++++++++++++++++++++ src/security/security_apparmor.c | 15 +++++++ src/security/security_dac.c | 34 ++++++++++++++++ src/security/security_selinux.c | 34 ++++++++++++++++ src/security/virt-aa-helper.c | 11 +++++- src/util/virpci.c | 68 ++++++++++++++++++++++++++++++++ src/util/virpci.h | 1 + 9 files changed, 268 insertions(+), 1 deletion(-)
diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 46a7dc1d8b..e15ffd2007 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -461,6 +461,54 @@ qemuTeardownInputCgroup(virDomainObj *vm, }
+int +qemuSetupIommufdCgroup(virDomainObj *vm) +{ + qemuDomainObjPrivate *priv = vm->privateData; + g_autoptr(DIR) dir = NULL; + struct dirent *dent; + g_autofree char *path = NULL; + int iommufd = 0; + size_t i; + + for (i = 0; i < vm->def->nhostdevs; i++) { + if (vm->def->hostdevs[i]->source.subsys.u.pci.driver.iommufd) { + iommufd = 1; + break; + } + } + + if (iommufd == 1) { + if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_DEVICES)) + return 0; + if (virDirOpen(&dir, "/dev/vfio/devices") < 0) { + if (errno == ENOENT) + return 0; + return -1; + } + while (virDirRead(dir, &dent, "/dev/vfio/devices") > 0) { + if (STRPREFIX(dent->d_name, "vfio")) { + path = g_strdup_printf("/dev/vfio/devices/%s", dent->d_name); + } + if (path && + qemuCgroupAllowDevicePath(vm, path, + VIR_CGROUP_DEVICE_RW, false) < 0) { + return -1;
This allows all the devices instead of just the ones the VM needs. Also, this is still a hostdev, so it should be done inside qemuSetupHostdevCgroup. Do hostdevs using iommufd also need access to a) /dev/vfio/vfio and b) /dev/vfio/<iommugroup> which were already allowed in qemuSetupHostdevCgroup?
+ } + path = NULL; + }
+ if (virFileExists("/dev/iommu")) + path = g_strdup("/dev/iommu");
No need to check for the existence of the device. If it does not exist, the VM won't start anyway. Also, is it necessary to allow these? libvirt already opened the files and passed file descriptors.
+ if (path && + qemuCgroupAllowDevicePath(vm, path, + VIR_CGROUP_DEVICE_RW, false) < 0) { + return -1; + } + } + return 0; +} + + /** * qemuSetupHostdevCgroup: * vm: domain object @@ -759,6 +807,7 @@ qemuSetupDevicesCgroup(virDomainObj *vm) g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(priv->driver); const char *const *deviceACL = (const char *const *) cfg->cgroupDeviceACL; int rv = -1; + int iommufd = 0; size_t i;
if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_DEVICES)) @@ -836,6 +885,18 @@ qemuSetupDevicesCgroup(virDomainObj *vm) return -1; }
+ for (i = 0; i < vm->def->nhostdevs; i++) { + if (vm->def->hostdevs[i]->source.subsys.u.pci.driver.iommufd) { + iommufd = 1; + break; + } + } +
No need to check this upfront. If /dev/iommu access is necessary, the per-hostdev function qemuSetupHostdevCgroup can add it to the list multiple times, like it already does for /dev/vfio/vfio
+ if (iommufd == 1) { + if (qemuSetupIommufdCgroup(vm) < 0) + return -1; + } + for (i = 0; i < vm->def->nmems; i++) { if (qemuSetupMemoryDevicesCgroup(vm, vm->def->mems[i]) < 0) return -1; diff --git a/src/qemu/qemu_cgroup.h b/src/qemu/qemu_cgroup.h index 3668034cde..bea677ba3c 100644 --- a/src/qemu/qemu_cgroup.h +++ b/src/qemu/qemu_cgroup.h @@ -42,6 +42,7 @@ int qemuSetupHostdevCgroup(virDomainObj *vm, int qemuTeardownHostdevCgroup(virDomainObj *vm, virDomainHostdevDef *dev) G_GNUC_WARN_UNUSED_RESULT; +int qemuSetupIommufdCgroup(virDomainObj *vm); int qemuSetupMemoryDevicesCgroup(virDomainObj *vm, virDomainMemoryDef *mem); int qemuTeardownMemoryDevicesCgroup(virDomainObj *vm, diff --git a/src/qemu/qemu_namespace.c b/src/qemu/qemu_namespace.c index 932777505b..80496f2f0f 100644 --- a/src/qemu/qemu_namespace.c +++ b/src/qemu/qemu_namespace.c @@ -683,6 +683,47 @@ qemuDomainSetupLaunchSecurity(virDomainObj *vm, }
+static int +qemuDomainSetupIommufd(virDomainObj *vm, + GSList **paths) +{ + g_autoptr(DIR) dir = NULL; + struct dirent *dent; + g_autofree char *path = NULL; + int iommufd = 0; + size_t i; + + for (i = 0; i < vm->def->nhostdevs; i++) { + if (vm->def->hostdevs[i]->source.subsys.u.pci.driver.iommufd) { + iommufd = 1; + break; + } + } + + /* Check if iommufd is enabled */ + if (iommufd == 1) { + if (virDirOpen(&dir, "/dev/vfio/devices") < 0) { + if (errno == ENOENT) + return 0; + return -1; + } + while (virDirRead(dir, &dent, "/dev/vfio/devices") > 0) { + if (STRPREFIX(dent->d_name, "vfio")) { + path = g_strdup_printf("/dev/vfio/devices/%s", dent->d_name); + *paths = g_slist_prepend(*paths, g_steal_pointer(&path)); + } + } + path = NULL; + if (virFileExists("/dev/iommu")) + path = g_strdup("/dev/iommu"); + if (path) + *paths = g_slist_prepend(*paths, g_steal_pointer(&path));
Same comments as for cgroups apply here too.
+ } + + return 0; +} + + static int qemuNamespaceMknodPaths(virDomainObj *vm, GSList *paths, @@ -706,6 +747,9 @@ qemuDomainBuildNamespace(virQEMUDriverConfig *cfg, if (qemuDomainSetupAllDisks(vm, &paths) < 0) return -1;
+ if (qemuDomainSetupIommufd(vm, &paths) < 0) + return -1; + if (qemuDomainSetupAllHostdevs(vm, &paths) < 0) return -1;
diff --git a/src/security/security_apparmor.c b/src/security/security_apparmor.c index 68ac39611f..0a878fd205 100644 --- a/src/security/security_apparmor.c +++ b/src/security/security_apparmor.c @@ -856,6 +856,21 @@ AppArmorSetSecurityHostdevLabel(virSecurityManager *mgr, } ret = AppArmorSetSecurityPCILabel(pci, vfioGroupDev, ptr); VIR_FREE(vfioGroupDev); + + if (dev->source.subsys.u.pci.driver.iommufd) { + g_autofree char *vfiofdDev = virPCIDeviceGetIOMMUFDDev(pci); + const char *iommufdDir = "/dev/iommu"; + if (vfiofdDev) { + int ret2 = AppArmorSetSecurityPCILabel(pci, vfiofdDev, ptr); + if (ret2 < 0) + ret = ret2; + ret2 = AppArmorSetSecurityPCILabel(pci, iommufdDir, ptr); + if (ret2 < 0) + ret = ret2; + } else { + return -1; + } + } } else { ret = virPCIDeviceFileIterate(pci, AppArmorSetSecurityPCILabel, ptr); } diff --git a/src/util/virpci.c b/src/util/virpci.c index 90617e69c6..6e6e5e47c0 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -2478,6 +2478,74 @@ virPCIDeviceGetIOMMUGroupDev(virPCIDevice *dev) return g_strdup_printf("/dev/vfio/%s", groupFile); }
+/* virPCIDeviceGetIOMMUFDDev - return the name of the device used + * to control this PCI device's group (e.g. "/dev/vfio/devices/vfio15") + */ +char * +virPCIDeviceGetIOMMUFDDev(virPCIDevice *dev) +{ + g_autofree char *path = NULL; + const char *pci_addr = NULL; + g_autoptr(DIR) dir = NULL; + struct dirent *entry; + char *vfiodev = NULL; + + /* Get PCI device address */
No need for this kind of comment - it's obvious from the variable and function names.
+ pci_addr = virPCIDeviceGetName(dev); + if (!pci_addr) + return NULL; + + /* First try: look in PCI device's vfio-dev subdirectory */ + path = g_strdup_printf("/sys/bus/pci/devices/%s/vfio-dev", pci_addr); + + if (virDirOpen(&dir, path) == 1) { + while (virDirRead(dir, &entry, path) > 0) { + if (!g_str_has_prefix(entry->d_name, "vfio")) + continue; + + vfiodev = g_strdup_printf("/dev/vfio/devices/%s", entry->d_name); + break; + } + /* g_autoptr will automatically close dir when it goes out of scope */
This comment is also obvious.
+ dir = NULL;
That does not make dir go out of scope. That's a memory leak. We try not to mix g_auto with manual freeing of variables, so either use two variables or two different scopes. Jano
+ } + + /* Second try: scan /sys/class/vfio-dev for matching device */ + if (!vfiodev) { + g_free(path); + path = g_strdup("/sys/class/vfio-dev"); +
On 11/21/2025 8:37 AM, Ján Tomko wrote:
On a Monday in 2025, Nathan Chen via Devel wrote:
Allow access to /dev/iommu and /dev/vfio/devices/vfio* when launching a qemu VM with iommufd feature enabled.
Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/qemu/qemu_cgroup.c | 61 ++++++++++++++++++++++++++++ src/qemu/qemu_cgroup.h | 1 + src/qemu/qemu_namespace.c | 44 +++++++++++++++++++++ src/security/security_apparmor.c | 15 +++++++ src/security/security_dac.c | 34 ++++++++++++++++ src/security/security_selinux.c | 34 ++++++++++++++++ src/security/virt-aa-helper.c | 11 +++++- src/util/virpci.c | 68 ++++++++++++++++++++++++++++++++ src/util/virpci.h | 1 + 9 files changed, 268 insertions(+), 1 deletion(-)
diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 46a7dc1d8b..e15ffd2007 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -461,6 +461,54 @@ qemuTeardownInputCgroup(virDomainObj *vm, }
+int +qemuSetupIommufdCgroup(virDomainObj *vm) +{ + qemuDomainObjPrivate *priv = vm->privateData; + g_autoptr(DIR) dir = NULL; + struct dirent *dent; + g_autofree char *path = NULL; + int iommufd = 0; + size_t i; + + for (i = 0; i < vm->def->nhostdevs; i++) { + if (vm->def->hostdevs[i]->source.subsys.u.pci.driver.iommufd) { + iommufd = 1; + break; + } + } + + if (iommufd == 1) { + if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_DEVICES)) + return 0; + if (virDirOpen(&dir, "/dev/vfio/devices") < 0) { + if (errno == ENOENT) + return 0; + return -1; + } + while (virDirRead(dir, &dent, "/dev/vfio/devices") > 0) { + if (STRPREFIX(dent->d_name, "vfio")) { + path = g_strdup_printf("/dev/vfio/devices/%s", dent-
d_name); + } + if (path && + qemuCgroupAllowDevicePath(vm, path, + VIR_CGROUP_DEVICE_RW, false) < 0) { + return -1;
This allows all the devices instead of just the ones the VM needs.
Also, this is still a hostdev, so it should be done inside qemuSetupHostdevCgroup.
Do hostdevs using iommufd also need access to a) /dev/vfio/vfio and b) /dev/vfio/<iommugroup> which were already allowed in qemuSetupHostdevCgroup?
They don't need access to these. I'll add a check to avoid providing access to these if iommufd is specified.
+ } + path = NULL; + }
+ if (virFileExists("/dev/iommu")) + path = g_strdup("/dev/iommu");
No need to check for the existence of the device. If it does not exist, the VM won't start anyway.
Also, is it necessary to allow these? libvirt already opened the files and passed file descriptors.
Thanks for catching this, I had included this cgroup and namespace logic from earlier patches when we did not pass the file descriptors. I'll exclude it in the next revision.
+ if (path && + qemuCgroupAllowDevicePath(vm, path, + VIR_CGROUP_DEVICE_RW, false) < 0) { + return -1; + } + } + return 0; +} + + /** * qemuSetupHostdevCgroup: * vm: domain object @@ -759,6 +807,7 @@ qemuSetupDevicesCgroup(virDomainObj *vm) g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(priv-
driver); const char *const *deviceACL = (const char *const *) cfg- cgroupDeviceACL; int rv = -1; + int iommufd = 0; size_t i;
if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_DEVICES)) @@ -836,6 +885,18 @@ qemuSetupDevicesCgroup(virDomainObj *vm) return -1; }
+ for (i = 0; i < vm->def->nhostdevs; i++) { + if (vm->def->hostdevs[i]->source.subsys.u.pci.driver.iommufd) { + iommufd = 1; + break; + } + } +
No need to check this upfront. If /dev/iommu access is necessary, the per-hostdev function qemuSetupHostdevCgroup can add it to the list multiple times, like it already does for /dev/vfio/vfio
That makes sense, I will remove this.
+ if (iommufd == 1) { + if (qemuSetupIommufdCgroup(vm) < 0) + return -1; + } + for (i = 0; i < vm->def->nmems; i++) { if (qemuSetupMemoryDevicesCgroup(vm, vm->def->mems[i]) < 0) return -1; diff --git a/src/qemu/qemu_cgroup.h b/src/qemu/qemu_cgroup.h index 3668034cde..bea677ba3c 100644 --- a/src/qemu/qemu_cgroup.h +++ b/src/qemu/qemu_cgroup.h @@ -42,6 +42,7 @@ int qemuSetupHostdevCgroup(virDomainObj *vm, int qemuTeardownHostdevCgroup(virDomainObj *vm, virDomainHostdevDef *dev) G_GNUC_WARN_UNUSED_RESULT; +int qemuSetupIommufdCgroup(virDomainObj *vm); int qemuSetupMemoryDevicesCgroup(virDomainObj *vm, virDomainMemoryDef *mem); int qemuTeardownMemoryDevicesCgroup(virDomainObj *vm, diff --git a/src/qemu/qemu_namespace.c b/src/qemu/qemu_namespace.c index 932777505b..80496f2f0f 100644 --- a/src/qemu/qemu_namespace.c +++ b/src/qemu/qemu_namespace.c @@ -683,6 +683,47 @@ qemuDomainSetupLaunchSecurity(virDomainObj *vm, }
+static int +qemuDomainSetupIommufd(virDomainObj *vm, + GSList **paths) +{ + g_autoptr(DIR) dir = NULL; + struct dirent *dent; + g_autofree char *path = NULL; + int iommufd = 0; + size_t i; + + for (i = 0; i < vm->def->nhostdevs; i++) { + if (vm->def->hostdevs[i]->source.subsys.u.pci.driver.iommufd) { + iommufd = 1; + break; + } + } + + /* Check if iommufd is enabled */ + if (iommufd == 1) { + if (virDirOpen(&dir, "/dev/vfio/devices") < 0) { + if (errno == ENOENT) + return 0; + return -1; + } + while (virDirRead(dir, &dent, "/dev/vfio/devices") > 0) { + if (STRPREFIX(dent->d_name, "vfio")) { + path = g_strdup_printf("/dev/vfio/devices/%s", dent-
d_name); + *paths = g_slist_prepend(*paths, g_steal_pointer(&path)); + } + } + path = NULL; + if (virFileExists("/dev/iommu")) + path = g_strdup("/dev/iommu"); + if (path) + *paths = g_slist_prepend(*paths, g_steal_pointer(&path));
Same comments as for cgroups apply here too.
Ok, I will update the namespace logic as well.
+ } + + return 0; +} + + static int qemuNamespaceMknodPaths(virDomainObj *vm, GSList *paths, @@ -706,6 +747,9 @@ qemuDomainBuildNamespace(virQEMUDriverConfig *cfg, if (qemuDomainSetupAllDisks(vm, &paths) < 0) return -1;
+ if (qemuDomainSetupIommufd(vm, &paths) < 0) + return -1; + if (qemuDomainSetupAllHostdevs(vm, &paths) < 0) return -1;
diff --git a/src/security/security_apparmor.c b/src/security/ security_apparmor.c index 68ac39611f..0a878fd205 100644 --- a/src/security/security_apparmor.c +++ b/src/security/security_apparmor.c @@ -856,6 +856,21 @@ AppArmorSetSecurityHostdevLabel(virSecurityManager *mgr, } ret = AppArmorSetSecurityPCILabel(pci, vfioGroupDev, ptr); VIR_FREE(vfioGroupDev); + + if (dev->source.subsys.u.pci.driver.iommufd) { + g_autofree char *vfiofdDev = virPCIDeviceGetIOMMUFDDev(pci); + const char *iommufdDir = "/dev/iommu"; + if (vfiofdDev) { + int ret2 = AppArmorSetSecurityPCILabel(pci, vfiofdDev, ptr); + if (ret2 < 0) + ret = ret2; + ret2 = AppArmorSetSecurityPCILabel(pci, iommufdDir, ptr); + if (ret2 < 0) + ret = ret2; + } else { + return -1; + } + } } else { ret = virPCIDeviceFileIterate(pci, AppArmorSetSecurityPCILabel, ptr); } diff --git a/src/util/virpci.c b/src/util/virpci.c index 90617e69c6..6e6e5e47c0 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -2478,6 +2478,74 @@ virPCIDeviceGetIOMMUGroupDev(virPCIDevice *dev) return g_strdup_printf("/dev/vfio/%s", groupFile); }
+/* virPCIDeviceGetIOMMUFDDev - return the name of the device used + * to control this PCI device's group (e.g. "/dev/vfio/devices/vfio15") + */ +char * +virPCIDeviceGetIOMMUFDDev(virPCIDevice *dev) +{ + g_autofree char *path = NULL; + const char *pci_addr = NULL; + g_autoptr(DIR) dir = NULL; + struct dirent *entry; + char *vfiodev = NULL; + + /* Get PCI device address */
No need for this kind of comment - it's obvious from the variable and function names.
+ pci_addr = virPCIDeviceGetName(dev); + if (!pci_addr) + return NULL; + + /* First try: look in PCI device's vfio-dev subdirectory */ + path = g_strdup_printf("/sys/bus/pci/devices/%s/vfio-dev", pci_addr); + + if (virDirOpen(&dir, path) == 1) { + while (virDirRead(dir, &entry, path) > 0) { + if (!g_str_has_prefix(entry->d_name, "vfio")) + continue; + + vfiodev = g_strdup_printf("/dev/vfio/devices/%s", entry-
d_name); + break; + } + /* g_autoptr will automatically close dir when it goes out of scope */
This comment is also obvious.
Yes agreed, I will exclude the obvious comments.
+ dir = NULL;
That does not make dir go out of scope. That's a memory leak.
We try not to mix g_auto with manual freeing of variables, so either use two variables or two different scopes.
I see, I will use two variables. Thanks, Nathan
Provide sample XML and CLI args for the iommufd XML schema for pc, q35, and virt machine types. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- .../iommufd-q35.x86_64-latest.args | 41 +++++++++++++ .../iommufd-q35.x86_64-latest.xml | 60 +++++++++++++++++++ tests/qemuxmlconfdata/iommufd-q35.xml | 38 ++++++++++++ .../iommufd-virt.aarch64-latest.args | 33 ++++++++++ .../iommufd-virt.aarch64-latest.xml | 34 +++++++++++ tests/qemuxmlconfdata/iommufd-virt.xml | 22 +++++++ .../iommufd.x86_64-latest.args | 35 +++++++++++ .../qemuxmlconfdata/iommufd.x86_64-latest.xml | 38 ++++++++++++ tests/qemuxmlconfdata/iommufd.xml | 30 ++++++++++ tests/qemuxmlconftest.c | 4 ++ 10 files changed, 335 insertions(+) create mode 100644 tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.args create mode 100644 tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.xml create mode 100644 tests/qemuxmlconfdata/iommufd-q35.xml create mode 100644 tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.args create mode 100644 tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.xml create mode 100644 tests/qemuxmlconfdata/iommufd-virt.xml create mode 100644 tests/qemuxmlconfdata/iommufd.x86_64-latest.args create mode 100644 tests/qemuxmlconfdata/iommufd.x86_64-latest.xml create mode 100644 tests/qemuxmlconfdata/iommufd.xml diff --git a/tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.args b/tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.args new file mode 100644 index 0000000000..7d819e141b --- /dev/null +++ b/tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.args @@ -0,0 +1,41 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/var/lib/libvirt/qemu/domain--1-q35-test \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/var/lib/libvirt/qemu/domain--1-q35-test/.local/share \ +XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain--1-q35-test/.cache \ +XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain--1-q35-test/.config \ +/usr/bin/qemu-system-x86_64 \ +-name guest=q35-test,debug-threads=on \ +-S \ +-object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain--1-q35-test/master-key.aes"}' \ +-machine q35,usb=off,dump-guest-core=off,memory-backend=pc.ram,acpi=off \ +-accel tcg \ +-cpu qemu64 \ +-m size=2097152k \ +-object '{"qom-type":"memory-backend-ram","id":"pc.ram","size":2147483648}' \ +-overcommit mem-lock=off \ +-smp 2,sockets=2,cores=1,threads=1 \ +-uuid 11dbdcdd-4c3b-482b-8903-9bdb8c0a2774 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,fd=1729,server=on,wait=off \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-boot strict=on \ +-device '{"driver":"pcie-root-port","port":16,"chassis":1,"id":"pci.1","bus":"pcie.0","multifunction":true,"addr":"0x2"}' \ +-device '{"driver":"pcie-root-port","port":17,"chassis":2,"id":"pci.2","bus":"pcie.0","addr":"0x2.0x1"}' \ +-device '{"driver":"qemu-xhci","id":"usb","bus":"pci.1","addr":"0x0"}' \ +-blockdev '{"driver":"host_device","filename":"/dev/HostVG/QEMUGuest1","node-name":"libvirt-1-storage","read-only":false}' \ +-device '{"driver":"ide-hd","bus":"ide.0","drive":"libvirt-1-storage","id":"sata0-0-0","bootindex":1}' \ +-audiodev '{"id":"audio1","driver":"none"}' \ +-device '{"driver":"qxl-vga","id":"video0","max_outputs":1,"ram_size":67108864,"vram_size":33554432,"vram64_size_mb":0,"vgamem_mb":8,"bus":"pcie.0","addr":"0x1"}' \ +-global ICH9-LPC.noreboot=off \ +-watchdog-action reset \ +-object '{"qom-type":"iommufd","id":"iommufd0","fd":"-1"}' \ +-device '{"driver":"vfio-pci","host":"0000:06:12.5","id":"hostdev0","iommufd":"iommufd0","fd":"0","bus":"pcie.0","addr":"0x3"}' \ +-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ +-msg timestamp=on diff --git a/tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.xml b/tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.xml new file mode 100644 index 0000000000..bb76252b61 --- /dev/null +++ b/tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.xml @@ -0,0 +1,60 @@ +<domain type='qemu'> + <name>q35-test</name> + <uuid>11dbdcdd-4c3b-482b-8903-9bdb8c0a2774</uuid> + <memory unit='KiB'>2097152</memory> + <currentMemory unit='KiB'>2097152</currentMemory> + <vcpu placement='static' cpuset='0-1'>2</vcpu> + <os> + <type arch='x86_64' machine='q35'>hvm</type> + <boot dev='hd'/> + </os> + <cpu mode='custom' match='exact' check='none'> + <model fallback='forbid'>qemu64</model> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <disk type='block' device='disk'> + <driver name='qemu' type='raw'/> + <source dev='/dev/HostVG/QEMUGuest1'/> + <target dev='sda' bus='sata'/> + <address type='drive' controller='0' bus='0' target='0' unit='0'/> + </disk> + <controller type='pci' index='0' model='pcie-root'/> + <controller type='pci' index='1' model='pcie-root-port'> + <model name='pcie-root-port'/> + <target chassis='1' port='0x10'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/> + </controller> + <controller type='pci' index='2' model='pcie-root-port'> + <model name='pcie-root-port'/> + <target chassis='2' port='0x11'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/> + </controller> + <controller type='sata' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/> + </controller> + <controller type='usb' index='0' model='qemu-xhci'> + <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> + </controller> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <audio id='1' type='none'/> + <video> + <model type='qxl' ram='65536' vram='32768' vgamem='8192' heads='1' primary='yes'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> + </video> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver iommufd='yes'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x5'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </hostdev> + <watchdog model='itco' action='reset'/> + <memballoon model='none'/> + </devices> +</domain> diff --git a/tests/qemuxmlconfdata/iommufd-q35.xml b/tests/qemuxmlconfdata/iommufd-q35.xml new file mode 100644 index 0000000000..f3c2269fb1 --- /dev/null +++ b/tests/qemuxmlconfdata/iommufd-q35.xml @@ -0,0 +1,38 @@ +<domain type='qemu'> + <name>q35-test</name> + <uuid>11dbdcdd-4c3b-482b-8903-9bdb8c0a2774</uuid> + <memory unit='KiB'>2097152</memory> + <currentMemory unit='KiB'>2097152</currentMemory> + <vcpu placement='static' cpuset='0-1'>2</vcpu> + <os> + <type arch='x86_64' machine='q35'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <disk type='block' device='disk'> + <source dev='/dev/HostVG/QEMUGuest1'/> + <target dev='sda' bus='sata'/> + <address type='drive' controller='0' bus='0' target='0' unit='0'/> + </disk> + <controller type='pci' index='0' model='pcie-root'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver iommufd='yes'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x5'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </hostdev> + <controller type='sata' index='0'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <video> + <model type='qxl' ram='65536' vram='32768' vgamem='8192' heads='1'/> + </video> + <memballoon model='none'/> + </devices> +</domain> diff --git a/tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.args b/tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.args new file mode 100644 index 0000000000..dbfd395168 --- /dev/null +++ b/tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.args @@ -0,0 +1,33 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/var/lib/libvirt/qemu/domain--1-foo \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/var/lib/libvirt/qemu/domain--1-foo/.local/share \ +XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain--1-foo/.cache \ +XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain--1-foo/.config \ +/usr/bin/qemu-system-aarch64 \ +-name guest=foo,debug-threads=on \ +-S \ +-object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain--1-foo/master-key.aes"}' \ +-machine virt,usb=off,gic-version=2,dump-guest-core=off,memory-backend=mach-virt.ram,acpi=off \ +-accel tcg \ +-cpu cortex-a15 \ +-m size=1048576k \ +-object '{"qom-type":"memory-backend-ram","id":"mach-virt.ram","size":1073741824}' \ +-overcommit mem-lock=off \ +-smp 1,sockets=1,cores=1,threads=1 \ +-uuid 6ba7b810-9dad-11d1-80b4-00c04fd430c8 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,fd=1729,server=on,wait=off \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-boot strict=on \ +-audiodev '{"id":"audio1","driver":"none"}' \ +-object '{"qom-type":"iommufd","id":"iommufd0","fd":"-1"}' \ +-device '{"driver":"vfio-pci","host":"0000:06:12.5","id":"hostdev0","iommufd":"iommufd0","fd":"0","bus":"pcie.0","addr":"0x1"}' \ +-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ +-msg timestamp=on diff --git a/tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.xml b/tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.xml new file mode 100644 index 0000000000..97b6e1e1c7 --- /dev/null +++ b/tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.xml @@ -0,0 +1,34 @@ +<domain type='qemu'> + <name>foo</name> + <uuid>6ba7b810-9dad-11d1-80b4-00c04fd430c8</uuid> + <memory unit='KiB'>1048576</memory> + <currentMemory unit='KiB'>1048576</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='aarch64' machine='virt'>hvm</type> + <boot dev='hd'/> + </os> + <features> + <gic version='2'/> + </features> + <cpu mode='custom' match='exact' check='none'> + <model fallback='forbid'>cortex-a15</model> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-aarch64</emulator> + <controller type='pci' index='0' model='pcie-root'/> + <audio id='1' type='none'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver iommufd='yes'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x5'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> + </hostdev> + <memballoon model='none'/> + </devices> +</domain> diff --git a/tests/qemuxmlconfdata/iommufd-virt.xml b/tests/qemuxmlconfdata/iommufd-virt.xml new file mode 100644 index 0000000000..c0b9d643b4 --- /dev/null +++ b/tests/qemuxmlconfdata/iommufd-virt.xml @@ -0,0 +1,22 @@ +<domain type='qemu'> + <name>foo</name> + <uuid>6ba7b810-9dad-11d1-80b4-00c04fd430c8</uuid> + <memory unit='KiB'>1048576</memory> + <currentMemory unit='KiB'>1048576</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='aarch64' machine='virt'>hvm</type> + </os> + <devices> + <emulator>/usr/bin/qemu-system-aarch64</emulator> + <controller type='pci' index='0' model='pcie-root'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver iommufd='yes'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x5'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> + </hostdev> + <memballoon model='none'/> + </devices> +</domain> diff --git a/tests/qemuxmlconfdata/iommufd.x86_64-latest.args b/tests/qemuxmlconfdata/iommufd.x86_64-latest.args new file mode 100644 index 0000000000..3130ba2e3a --- /dev/null +++ b/tests/qemuxmlconfdata/iommufd.x86_64-latest.args @@ -0,0 +1,35 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/var/lib/libvirt/qemu/domain--1-foo \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/var/lib/libvirt/qemu/domain--1-foo/.local/share \ +XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain--1-foo/.cache \ +XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain--1-foo/.config \ +/usr/bin/qemu-system-x86_64 \ +-name guest=foo,debug-threads=on \ +-S \ +-object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain--1-foo/master-key.aes"}' \ +-machine pc,usb=off,dump-guest-core=off,memory-backend=pc.ram,acpi=off \ +-accel tcg \ +-cpu qemu64 \ +-m size=2097152k \ +-object '{"qom-type":"memory-backend-ram","id":"pc.ram","size":2147483648}' \ +-overcommit mem-lock=off \ +-smp 2,sockets=2,cores=1,threads=1 \ +-uuid 3c7c30b5-7866-4b05-8a29-efebccba52a0 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,fd=1729,server=on,wait=off \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-boot strict=on \ +-device '{"driver":"piix3-usb-uhci","id":"usb","bus":"pci.0","addr":"0x1.0x2"}' \ +-audiodev '{"id":"audio1","driver":"none"}' \ +-object '{"qom-type":"iommufd","id":"iommufd0","fd":"-1"}' \ +-device '{"driver":"vfio-pci","host":"0000:06:12.5","id":"hostdev0","iommufd":"iommufd0","fd":"0","bus":"pci.0","addr":"0x3"}' \ +-device '{"driver":"virtio-balloon-pci","id":"balloon0","bus":"pci.0","addr":"0x2"}' \ +-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ +-msg timestamp=on diff --git a/tests/qemuxmlconfdata/iommufd.x86_64-latest.xml b/tests/qemuxmlconfdata/iommufd.x86_64-latest.xml new file mode 100644 index 0000000000..2e8951aaf6 --- /dev/null +++ b/tests/qemuxmlconfdata/iommufd.x86_64-latest.xml @@ -0,0 +1,38 @@ +<domain type='qemu'> + <name>foo</name> + <uuid>3c7c30b5-7866-4b05-8a29-efebccba52a0</uuid> + <memory unit='KiB'>2097152</memory> + <currentMemory unit='KiB'>2097152</currentMemory> + <vcpu placement='static' cpuset='0-1'>2</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <cpu mode='custom' match='exact' check='none'> + <model fallback='forbid'>qemu64</model> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='pci' index='0' model='pci-root'/> + <controller type='usb' index='0' model='piix3-uhci'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> + </controller> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <audio id='1' type='none'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver iommufd='yes'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x5'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </hostdev> + <memballoon model='virtio'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> + </memballoon> + </devices> +</domain> diff --git a/tests/qemuxmlconfdata/iommufd.xml b/tests/qemuxmlconfdata/iommufd.xml new file mode 100644 index 0000000000..eb278414d2 --- /dev/null +++ b/tests/qemuxmlconfdata/iommufd.xml @@ -0,0 +1,30 @@ +<domain type='qemu'> + <name>foo</name> + <uuid>3c7c30b5-7866-4b05-8a29-efebccba52a0</uuid> + <memory unit='KiB'>2097152</memory> + <currentMemory unit='KiB'>2097152</currentMemory> + <vcpu placement='static' cpuset='0-1'>2</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='pci' index='0' model='pci-root'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver iommufd='yes'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x5'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </hostdev> + <controller type='usb' index='0'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxmlconftest.c b/tests/qemuxmlconftest.c index dd55c1ef28..7705aba7bf 100644 --- a/tests/qemuxmlconftest.c +++ b/tests/qemuxmlconftest.c @@ -3045,6 +3045,10 @@ mymain(void) DO_TEST_CAPS_LATEST_PARSE_ERROR("virtio-iommu-dma-translation"); DO_TEST_CAPS_LATEST("acpi-generic-initiator"); + DO_TEST_CAPS_LATEST("iommufd"); + DO_TEST_CAPS_LATEST("iommufd-q35"); + DO_TEST_CAPS_ARCH_LATEST("iommufd-virt", "aarch64"); + DO_TEST_CAPS_LATEST("cpu-hotplug-startup"); DO_TEST_CAPS_ARCH_LATEST_PARSE_ERROR("cpu-hotplug-granularity", "ppc64"); -- 2.43.0
On a Monday in 2025, Nathan Chen via Devel wrote:
Hi,
This series implements support for using iommufd to propagate DMA mappings to the kernel for VM-assigned host devices in a qemu VM.
We add a new 'iommufd' attribute for hostdev devices to be associated with the iommufd object.
For instance, specifying the iommufd object and associated hostdev in a VM definition:
<devices> ... <hostdev mode='subsystem' type='pci' managed='no'> <driver iommufd='yes'/> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x15' slot='0x00' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='no'> <driver iommufd='yes'/> <source> <address domain='0x0019' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x16' slot='0x00' function='0x0'/> </hostdev> ... </devices>
Are there any hardware/kernel requirements, or something done to be host? Even when I add the device to the vfio-pci driver, it does not create /dev/vfio/devices for me: error: unsupported configuration: VFIO device /dev/vfio/devices/vfio0 not found - ensure device is bound to vfio-pci driver Kernel: 6.17.6-300.fc43.x86_64 on Fedora QEMU: v10.1.0-2147-g917ac07f9a (the current master) Also, I'd expect it to just work with managed='yes'. Jano
This would get translated to a qemu command line with the arguments below. Note that libvirt will open the /dev/iommu and VFIO cdev, passing the associated fd number to qemu:
-object '{"qom-type":"iommufd","id":"iommufd0","fd":"24"}' \ -device '{"driver":"vfio-pci","host":"0009:01:00.0","id":"hostdev0","iommufd":"iommufd0","fd":"22","bus":"pci.21","addr":"0x0"}' \ -device '{"driver":"vfio-pci","host":"0019:01:00.0","id":"hostdev1","iommufd":"iommufd0","fd":"25","bus":"pci.22","addr":"0x0"}' \
This series is on Github: https://github.com/NathanChenNVIDIA/libvirt/tree/iommufd-10-23-25
Thanks, Nathan
Signed-off-by: Nathan Chen <nathanc@nvidia.com>
Nathan Chen (4): qemu: Implement support for associating iommufd to hostdev qemu: open iommufd FDs from libvirt backend qemu: Update Cgroup, namespace, and seclabel for qemu to access iommufd paths tests: qemuxmlconfdata: provide iommufd sample XML and CLI args
On 11/6/2025 10:14 AM, Ján Tomko wrote:
Hi,
This series implements support for using iommufd to propagate DMA mappings to the kernel for VM-assigned host devices in a qemu VM.
We add a new 'iommufd' attribute for hostdev devices to be associated with the iommufd object.
For instance, specifying the iommufd object and associated hostdev in a VM definition:
<devices> ... <hostdev mode='subsystem' type='pci' managed='no'> <driver iommufd='yes'/> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x15' slot='0x00' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='no'> <driver iommufd='yes'/> <source> <address domain='0x0019' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x16' slot='0x00' function='0x0'/> </hostdev> ... </devices>
Are there any hardware/kernel requirements, or something done to be host? Even when I add the device to the vfio-pci driver, it does not create /dev/vfio/devices for me:
error: unsupported configuration: VFIO device /dev/vfio/devices/vfio0 not found - ensure device is bound to vfio-pci driver
Kernel: 6.17.6-300.fc43.x86_64 on Fedora QEMU: v10.1.0-2147-g917ac07f9a (the current master)
Also, I'd expect it to just work with managed='yes'.
The iommufd module should be loaded before adding the device to the vfio-pci driver: # lsmod | grep iommufd iommufd 327680 1 vfio I will ensure the iommufd module gets loaded with managed='yes' in the next revision, thanks for catching that. -Nathan
On a Thursday in 2025, Nathan Chen wrote:
On 11/6/2025 10:14 AM, Ján Tomko wrote:
Hi,
This series implements support for using iommufd to propagate DMA mappings to the kernel for VM-assigned host devices in a qemu VM.
We add a new 'iommufd' attribute for hostdev devices to be associated with the iommufd object.
For instance, specifying the iommufd object and associated hostdev in a VM definition:
<devices> ... <hostdev mode='subsystem' type='pci' managed='no'> <driver iommufd='yes'/> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x15' slot='0x00' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='no'> <driver iommufd='yes'/> <source> <address domain='0x0019' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x16' slot='0x00' function='0x0'/> </hostdev> ... </devices>
Are there any hardware/kernel requirements, or something done to be host? Even when I add the device to the vfio-pci driver, it does not create /dev/vfio/devices for me:
error: unsupported configuration: VFIO device /dev/vfio/devices/vfio0 not found - ensure device is bound to vfio-pci driver
Kernel: 6.17.6-300.fc43.x86_64 on Fedora
The answer was that the Fedora kernel did not have CONFIG_VFIO_DEVICE_CDEV=y enabled. Jano
QEMU: v10.1.0-2147-g917ac07f9a (the current master)
Also, I'd expect it to just work with managed='yes'.
The iommufd module should be loaded before adding the device to the vfio-pci driver: # lsmod | grep iommufd iommufd 327680 1 vfio
I will ensure the iommufd module gets loaded with managed='yes' in the next revision, thanks for catching that.
-Nathan
On 11/21/2025 7:14 AM, Ján Tomko wrote:
On a Thursday in 2025, Nathan Chen wrote:
On 11/6/2025 10:14 AM, Ján Tomko wrote:
Hi,
This series implements support for using iommufd to propagate DMA mappings to the kernel for VM-assigned host devices in a qemu VM.
We add a new 'iommufd' attribute for hostdev devices to be associated with the iommufd object.
For instance, specifying the iommufd object and associated hostdev in a VM definition:
<devices> ... <hostdev mode='subsystem' type='pci' managed='no'> <driver iommufd='yes'/> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x15' slot='0x00' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='no'> <driver iommufd='yes'/> <source> <address domain='0x0019' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x16' slot='0x00' function='0x0'/> </hostdev> ... </devices>
Are there any hardware/kernel requirements, or something done to be host? Even when I add the device to the vfio-pci driver, it does not create /dev/vfio/devices for me:
error: unsupported configuration: VFIO device /dev/vfio/devices/vfio0 not found - ensure device is bound to vfio-pci driver
Kernel: 6.17.6-300.fc43.x86_64 on Fedora
The answer was that the Fedora kernel did not have CONFIG_VFIO_DEVICE_CDEV=y enabled.
Jano
That makes sense - loading the VFIO driver should already load the iommufd module. I will avoid including any changes in the 'managed' hostdev logic for iommufd especially because we are not binding/unbinding hostdevs to the iommufd module. Thanks, Nathan
QEMU: v10.1.0-2147-g917ac07f9a (the current master)
Also, I'd expect it to just work with managed='yes'.
The iommufd module should be loaded before adding the device to the vfio-pci driver: # lsmod | grep iommufd iommufd 327680 1 vfio
I will ensure the iommufd module gets loaded with managed='yes' in the next revision, thanks for catching that.
participants (4)
-
Andrea Bolognani -
Ján Tomko -
Laine Stump -
Nathan Chen