[PATCH v2 0/5] qemu: Implement support for iommufd
Hi, This is a follow up to the first patch series [0] for using iommufd to propagate DMA mappings to the kernel for VM-assigned host devices in a qemu VM. We add a new 'iommufd' attribute for hostdev devices to be associated with the iommufd object. For instance, specifying the iommufd object and associated hostdev in a VM definition: <devices> ... <hostdev mode='subsystem' type='pci' managed='no'> <driver iommufd='yes'/> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x15' slot='0x00' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='no'> <driver iommufd='yes'/> <source> <address domain='0x0019' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x16' slot='0x00' function='0x0'/> </hostdev> ... </devices> This would get translated to a qemu command line with the arguments below. Note that libvirt will open the /dev/iommu and VFIO cdev, passing the associated fd number to qemu: -object '{"qom-type":"iommufd","id":"iommufd0","fd":"24"}' \ -device '{"driver":"vfio-pci","host":"0009:01:00.0","id":"hostdev0","iommufd":"iommufd0","fd":"22","bus":"pci.21","addr":"0x0"}' \ -device '{"driver":"vfio-pci","host":"0019:01:00.0","id":"hostdev1","iommufd":"iommufd0","fd":"25","bus":"pci.22","addr":"0x0"}' \ Note that when Libvirt launches the qemu process as an unprivileged user, i.e. libvirt-qemu, it is bound by the RLIMIT_MEMLOCK limit calculated based on how much RAM is assigned to the VM. IOMMUFD's unified memory accounting causes ENOMEM errors for large device BARs that exceed the process memlock limit. VFIO without IOMMUFD does not check device memory against the limit and therefore does not show this behavior. To work around the issue, we can specify CAP_IPC_LOCK=ep capability for the qemu binary to bypass the RLIMIT_MEMLOCK limit. I am working on a long-term fix in the kernel to mark device memory regions so they do not count against the RLIMIT_MEMLOCK limit for how much memory can be locked/pinned, as I/O device BARs do not consume host RAM. This series is on Github: https://github.com/NathanChenNVIDIA/libvirt/tree/iommufd-11-25 Thanks, Nathan [0] https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/MYPM5... Signed-off-by: Nathan Chen <nathanc@nvidia.com> Nathan Chen (5): qemu: Implement support for associating iommufd to hostdev qemu: open VFIO FDs from libvirt backend qemu: open iommufd FD from libvirt backend qemu: Update Cgroup, namespace, and seclabel for iommufd tests: qemuxmlconfdata: provide iommufd sample XML and CLI args docs/formatdomain.rst | 8 + src/conf/device_conf.c | 12 ++ src/conf/device_conf.h | 1 + src/conf/domain_conf.h | 2 + src/conf/schemas/basictypes.rng | 5 + src/libvirt_private.syms | 1 + src/qemu/qemu_cgroup.c | 26 +-- src/qemu/qemu_command.c | 53 ++++- src/qemu/qemu_domain.c | 40 ++++ src/qemu/qemu_domain.h | 19 ++ src/qemu/qemu_namespace.c | 16 +- src/qemu/qemu_process.c | 186 ++++++++++++++++++ src/security/security_apparmor.c | 18 +- src/security/security_dac.c | 28 ++- src/security/security_selinux.c | 28 ++- src/security/virt-aa-helper.c | 11 +- src/util/virpci.c | 69 +++++++ src/util/virpci.h | 2 + .../iommufd-q35.x86_64-latest.args | 41 ++++ .../iommufd-q35.x86_64-latest.xml | 60 ++++++ tests/qemuxmlconfdata/iommufd-q35.xml | 38 ++++ .../iommufd-virt.aarch64-latest.args | 33 ++++ .../iommufd-virt.aarch64-latest.xml | 34 ++++ tests/qemuxmlconfdata/iommufd-virt.xml | 22 +++ .../iommufd.x86_64-latest.args | 35 ++++ .../qemuxmlconfdata/iommufd.x86_64-latest.xml | 38 ++++ tests/qemuxmlconfdata/iommufd.xml | 30 +++ tests/qemuxmlconftest.c | 33 ++++ 28 files changed, 839 insertions(+), 50 deletions(-) create mode 100644 tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.args create mode 100644 tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.xml create mode 100644 tests/qemuxmlconfdata/iommufd-q35.xml create mode 100644 tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.args create mode 100644 tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.xml create mode 100644 tests/qemuxmlconfdata/iommufd-virt.xml create mode 100644 tests/qemuxmlconfdata/iommufd.x86_64-latest.args create mode 100644 tests/qemuxmlconfdata/iommufd.x86_64-latest.xml create mode 100644 tests/qemuxmlconfdata/iommufd.xml -- 2.43.0
Implement a new iommufd attribute under hostdevs' PCI subsystem driver that can be used to specify associated iommufd object when launching a qemu VM. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- docs/formatdomain.rst | 8 ++++++++ src/conf/device_conf.c | 12 ++++++++++++ src/conf/device_conf.h | 1 + src/conf/schemas/basictypes.rng | 5 +++++ src/qemu/qemu_command.c | 19 +++++++++++++++++++ 5 files changed, 45 insertions(+) diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index 160e7ad9c7..dcb24b1b23 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -4862,6 +4862,7 @@ or: device; if PCI ROM loading is disabled through this attribute, attempts to tweak the loading process further using the ``bar`` or ``file`` attributes will be rejected. :since:`Since 4.3.0 (QEMU and KVM only)`. + ``address`` The ``address`` element for USB devices has a ``bus`` and ``device`` attribute to specify the USB bus and device number the device appears at on @@ -4902,6 +4903,13 @@ or: found is "problematic" in some way, the generic vfio-pci driver similarly be forced. + The ``<driver>`` element's ``iommufd`` attribute is used to specify + using the iommufd interface to propagate DMA mappings to the kernel, + instead of VFIO alone. When the attribute is present, an iommufd + object will be created by the resulting qemu command. Libvirt will + open the /dev/iommu and VFIO device cdev, passing the associated + file descriptor numbers to the qemu command. + (Note: :since:`Since 1.0.5`, the ``name`` attribute has been described to be used to select the type of PCI device assignment ("vfio", "kvm", or "xen"), but those values have been mostly diff --git a/src/conf/device_conf.c b/src/conf/device_conf.c index c278b81652..7682236d65 100644 --- a/src/conf/device_conf.c +++ b/src/conf/device_conf.c @@ -60,6 +60,8 @@ int virDeviceHostdevPCIDriverInfoParseXML(xmlNodePtr node, virDeviceHostdevPCIDriverInfo *driver) { + virTristateBool iommufd; + driver->iommufd = false; if (virXMLPropEnum(node, "name", virDeviceHostdevPCIDriverNameTypeFromString, VIR_XML_PROP_NONZERO, @@ -67,6 +69,10 @@ virDeviceHostdevPCIDriverInfoParseXML(xmlNodePtr node, return -1; } + if (virXMLPropTristateBool(node, "iommufd", VIR_XML_PROP_NONE, &iommufd) < 0) + return -1; + driver->iommufd = iommufd; + driver->model = virXMLPropString(node, "model"); return 0; } @@ -93,6 +99,12 @@ virDeviceHostdevPCIDriverInfoFormat(virBuffer *buf, virBufferEscapeString(&driverAttrBuf, " model='%s'", driver->model); + if (driver->iommufd == VIR_TRISTATE_BOOL_YES) { + virBufferAddLit(&driverAttrBuf, " iommufd='yes'"); + } else if (driver->iommufd == VIR_TRISTATE_BOOL_NO) { + virBufferAddLit(&driverAttrBuf, " iommufd='no'"); + } + virXMLFormatElement(buf, "driver", &driverAttrBuf, NULL); return 0; } diff --git a/src/conf/device_conf.h b/src/conf/device_conf.h index e570f51824..116b959143 100644 --- a/src/conf/device_conf.h +++ b/src/conf/device_conf.h @@ -47,6 +47,7 @@ VIR_ENUM_DECL(virDeviceHostdevPCIDriverName); struct _virDeviceHostdevPCIDriverInfo { virDeviceHostdevPCIDriverName name; char *model; + virTristateBool iommufd; }; typedef enum { diff --git a/src/conf/schemas/basictypes.rng b/src/conf/schemas/basictypes.rng index 2931e316b7..089fc0f1c2 100644 --- a/src/conf/schemas/basictypes.rng +++ b/src/conf/schemas/basictypes.rng @@ -673,6 +673,11 @@ <ref name="genericName"/> </attribute> </optional> + <optional> + <attribute name="iommufd"> + <ref name="virYesNo"/> + </attribute> + </optional> <empty/> </element> </define> diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 5a834ef842..95d1c2ee98 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -4753,6 +4753,7 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, g_autofree char *host = virPCIDeviceAddressAsString(&pcisrc->addr); const char *failover_pair_id = NULL; const char *driver = NULL; + const char *iommufdId = NULL; /* 'ramfb' property must be omitted unless it's to be enabled */ bool ramfb = pcisrc->ramfb == VIR_TRISTATE_SWITCH_ON; @@ -4786,6 +4787,9 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, teaming->persistent) failover_pair_id = teaming->persistent; + if (pcisrc->driver.iommufd == VIR_TRISTATE_BOOL_YES) + iommufdId = "iommufd0"; + if (virJSONValueObjectAdd(&props, "s:driver", driver, "s:host", host, @@ -4794,6 +4798,7 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, "S:failover_pair_id", failover_pair_id, "S:display", qemuOnOffAuto(pcisrc->display), "B:ramfb", ramfb, + "S:iommufd", iommufdId, NULL) < 0) return NULL; @@ -5210,6 +5215,9 @@ qemuBuildHostdevCommandLine(virCommand *cmd, virQEMUCaps *qemuCaps) { size_t i; + g_autoptr(virJSONValue) props = NULL; + int iommufd = 0; + const char * iommufdId = "iommufd0"; for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevDef *hostdev = def->hostdevs[i]; @@ -5238,6 +5246,17 @@ qemuBuildHostdevCommandLine(virCommand *cmd, if (hostdev->info->type == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED) continue; + if (subsys->u.pci.driver.iommufd == VIR_TRISTATE_BOOL_YES && iommufd == 0) { + iommufd = 1; + if (qemuMonitorCreateObjectProps(&props, "iommufd", + iommufdId, + NULL) < 0) + return -1; + + if (qemuBuildObjectCommandlineFromJSON(cmd, props) < 0) + return -1; + } + if (qemuCommandAddExtDevice(cmd, hostdev->info, def, qemuCaps) < 0) return -1; -- 2.43.0
On a Friday in 2025, Nathan Chen via Devel wrote: To get rid of the "via Devel" part in git commit's Author: field, set the format.from and format.forceInBodyFrom fields in your gitconfig: https://libvirt.org/submitting-patches.html#git-configuration
Implement a new iommufd attribute under hostdevs' PCI subsystem driver that can be used to specify associated iommufd object when launching a qemu VM.
Also, I've applied most of the stuff I say here in my branch, mostly to see if they are possible: https://gitlab.com/janotomko/libvirt/-/tree/iommufd?ref_type=heads git fetch https://gitlab.com/janotomko/libvirt.git iommufd
Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- docs/formatdomain.rst | 8 ++++++++ src/conf/device_conf.c | 12 ++++++++++++ src/conf/device_conf.h | 1 + src/conf/schemas/basictypes.rng | 5 +++++ src/qemu/qemu_command.c | 19 +++++++++++++++++++ 5 files changed, 45 insertions(+)
diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index 160e7ad9c7..dcb24b1b23 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -4862,6 +4862,7 @@ or: device; if PCI ROM loading is disabled through this attribute, attempts to tweak the loading process further using the ``bar`` or ``file`` attributes will be rejected. :since:`Since 4.3.0 (QEMU and KVM only)`. +
Unrelated and unnecessary whitespace change (lots of others in the file do not put a space between attributes)
``address`` The ``address`` element for USB devices has a ``bus`` and ``device`` attribute to specify the USB bus and device number the device appears at on @@ -4902,6 +4903,13 @@ or: found is "problematic" in some way, the generic vfio-pci driver similarly be forced.
+ The ``<driver>`` element's ``iommufd`` attribute is used to specify + using the iommufd interface to propagate DMA mappings to the kernel, + instead of VFIO alone. When the attribute is present, an iommufd + object will be created by the resulting qemu command. Libvirt will + open the /dev/iommu and VFIO device cdev, passing the associated + file descriptor numbers to the qemu command. + (Note: :since:`Since 1.0.5`, the ``name`` attribute has been described to be used to select the type of PCI device assignment ("vfio", "kvm", or "xen"), but those values have been mostly diff --git a/src/conf/device_conf.c b/src/conf/device_conf.c index c278b81652..7682236d65 100644 --- a/src/conf/device_conf.c +++ b/src/conf/device_conf.c @@ -60,6 +60,8 @@ int virDeviceHostdevPCIDriverInfoParseXML(xmlNodePtr node, virDeviceHostdevPCIDriverInfo *driver) { + virTristateBool iommufd;
This intermediary variable is not needed
+ driver->iommufd = false;
and this is not a bool variable.
if (virXMLPropEnum(node, "name", virDeviceHostdevPCIDriverNameTypeFromString, VIR_XML_PROP_NONZERO, @@ -67,6 +69,10 @@ virDeviceHostdevPCIDriverInfoParseXML(xmlNodePtr node, return -1; }
+ if (virXMLPropTristateBool(node, "iommufd", VIR_XML_PROP_NONE, &iommufd) < 0) + return -1;
Just let virXMLPropTristateBool handle the default and write into driver->iommufd directly.
+ driver->iommufd = iommufd; + driver->model = virXMLPropString(node, "model"); return 0; } @@ -93,6 +99,12 @@ virDeviceHostdevPCIDriverInfoFormat(virBuffer *buf,
virBufferEscapeString(&driverAttrBuf, " model='%s'", driver->model);
+ if (driver->iommufd == VIR_TRISTATE_BOOL_YES) { + virBufferAddLit(&driverAttrBuf, " iommufd='yes'"); + } else if (driver->iommufd == VIR_TRISTATE_BOOL_NO) { + virBufferAddLit(&driverAttrBuf, " iommufd='no'"); + } + virXMLFormatElement(buf, "driver", &driverAttrBuf, NULL); return 0; } diff --git a/src/conf/device_conf.h b/src/conf/device_conf.h index e570f51824..116b959143 100644 --- a/src/conf/device_conf.h +++ b/src/conf/device_conf.h @@ -47,6 +47,7 @@ VIR_ENUM_DECL(virDeviceHostdevPCIDriverName); struct _virDeviceHostdevPCIDriverInfo { virDeviceHostdevPCIDriverName name; char *model; + virTristateBool iommufd; };
typedef enum { diff --git a/src/conf/schemas/basictypes.rng b/src/conf/schemas/basictypes.rng index 2931e316b7..089fc0f1c2 100644 --- a/src/conf/schemas/basictypes.rng +++ b/src/conf/schemas/basictypes.rng @@ -673,6 +673,11 @@ <ref name="genericName"/> </attribute> </optional> + <optional> + <attribute name="iommufd"> + <ref name="virYesNo"/> + </attribute> + </optional> <empty/> </element> </define> diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 5a834ef842..95d1c2ee98 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -4753,6 +4753,7 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, g_autofree char *host = virPCIDeviceAddressAsString(&pcisrc->addr); const char *failover_pair_id = NULL; const char *driver = NULL; + const char *iommufdId = NULL; /* 'ramfb' property must be omitted unless it's to be enabled */ bool ramfb = pcisrc->ramfb == VIR_TRISTATE_SWITCH_ON;
@@ -4786,6 +4787,9 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, teaming->persistent) failover_pair_id = teaming->persistent;
+ if (pcisrc->driver.iommufd == VIR_TRISTATE_BOOL_YES) + iommufdId = "iommufd0"; + if (virJSONValueObjectAdd(&props, "s:driver", driver, "s:host", host, @@ -4794,6 +4798,7 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, "S:failover_pair_id", failover_pair_id, "S:display", qemuOnOffAuto(pcisrc->display), "B:ramfb", ramfb, + "S:iommufd", iommufdId, NULL) < 0) return NULL;
@@ -5210,6 +5215,9 @@ qemuBuildHostdevCommandLine(virCommand *cmd, virQEMUCaps *qemuCaps) { size_t i; + g_autoptr(virJSONValue) props = NULL; + int iommufd = 0; + const char * iommufdId = "iommufd0";
for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevDef *hostdev = def->hostdevs[i]; @@ -5238,6 +5246,17 @@ qemuBuildHostdevCommandLine(virCommand *cmd, if (hostdev->info->type == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED) continue;
+ if (subsys->u.pci.driver.iommufd == VIR_TRISTATE_BOOL_YES && iommufd == 0) { + iommufd = 1; + if (qemuMonitorCreateObjectProps(&props, "iommufd", + iommufdId, + NULL) < 0) + return -1;
Formatting the iommufd object should be in a separate function, e.g. qemuBuildIOMMUFDCommandLine, not here in qemuBuildHostdevCommandLine. Moving it out also fixes the memory leak caused by reusal of 'props' without freeing them. Jano
+ + if (qemuBuildObjectCommandlineFromJSON(cmd, props) < 0) + return -1; + } + if (qemuCommandAddExtDevice(cmd, hostdev->info, def, qemuCaps) < 0) return -1;
-- 2.43.0
On 12/8/2025 9:58 AM, Ján Tomko wrote:
On a Friday in 2025, Nathan Chen via Devel wrote:
To get rid of the "via Devel" part in git commit's Author: field, set the format.from and format.forceInBodyFrom fields in your gitconfig:
https://libvirt.org/submitting-patches.html#git-configuration
Thanks, I will apply this configuration for future patch submissions.
Implement a new iommufd attribute under hostdevs' PCI subsystem driver that can be used to specify associated iommufd object when launching a qemu VM.
Also, I've applied most of the stuff I say here in my branch, mostly to see if they are possible: https://gitlab.com/janotomko/libvirt/-/tree/iommufd?ref_type=heads git fetch https://gitlab.com/janotomko/libvirt.git iommufd
Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- docs/formatdomain.rst | 8 ++++++++ src/conf/device_conf.c | 12 ++++++++++++ src/conf/device_conf.h | 1 + src/conf/schemas/basictypes.rng | 5 +++++ src/qemu/qemu_command.c | 19 +++++++++++++++++++ 5 files changed, 45 insertions(+)
diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index 160e7ad9c7..dcb24b1b23 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -4862,6 +4862,7 @@ or: device; if PCI ROM loading is disabled through this attribute, attempts to tweak the loading process further using the ``bar`` or ``file`` attributes will be rejected. :since:`Since 4.3.0 (QEMU and KVM only)`. +
Unrelated and unnecessary whitespace change (lots of others in the file do not put a space between attributes)
Got it, I will clean up unnecessary whitespaces in the next revision.
``address`` The ``address`` element for USB devices has a ``bus`` and ``device`` attribute to specify the USB bus and device number the device appears at on @@ -4902,6 +4903,13 @@ or: found is "problematic" in some way, the generic vfio-pci driver similarly be forced.
+ The ``<driver>`` element's ``iommufd`` attribute is used to specify + using the iommufd interface to propagate DMA mappings to the kernel, + instead of VFIO alone. When the attribute is present, an iommufd + object will be created by the resulting qemu command. Libvirt will + open the /dev/iommu and VFIO device cdev, passing the associated + file descriptor numbers to the qemu command. + (Note: :since:`Since 1.0.5`, the ``name`` attribute has been described to be used to select the type of PCI device assignment ("vfio", "kvm", or "xen"), but those values have been mostly diff --git a/src/conf/device_conf.c b/src/conf/device_conf.c index c278b81652..7682236d65 100644 --- a/src/conf/device_conf.c +++ b/src/conf/device_conf.c @@ -60,6 +60,8 @@ int virDeviceHostdevPCIDriverInfoParseXML(xmlNodePtr node, virDeviceHostdevPCIDriverInfo *driver) { + virTristateBool iommufd;
This intermediary variable is not needed
+ driver->iommufd = false;
and this is not a bool variable.
if (virXMLPropEnum(node, "name", virDeviceHostdevPCIDriverNameTypeFromString, VIR_XML_PROP_NONZERO, @@ -67,6 +69,10 @@ virDeviceHostdevPCIDriverInfoParseXML(xmlNodePtr node, return -1; }
+ if (virXMLPropTristateBool(node, "iommufd", VIR_XML_PROP_NONE, &iommufd) < 0) + return -1;
Just let virXMLPropTristateBool handle the default and write into driver->iommufd directly.
Ok, I will fix these three in the next revision.
+ driver->iommufd = iommufd; + driver->model = virXMLPropString(node, "model"); return 0; } @@ -93,6 +99,12 @@ virDeviceHostdevPCIDriverInfoFormat(virBuffer *buf,
virBufferEscapeString(&driverAttrBuf, " model='%s'", driver->model);
+ if (driver->iommufd == VIR_TRISTATE_BOOL_YES) { + virBufferAddLit(&driverAttrBuf, " iommufd='yes'"); + } else if (driver->iommufd == VIR_TRISTATE_BOOL_NO) { + virBufferAddLit(&driverAttrBuf, " iommufd='no'"); + } + virXMLFormatElement(buf, "driver", &driverAttrBuf, NULL); return 0; } diff --git a/src/conf/device_conf.h b/src/conf/device_conf.h index e570f51824..116b959143 100644 --- a/src/conf/device_conf.h +++ b/src/conf/device_conf.h @@ -47,6 +47,7 @@ VIR_ENUM_DECL(virDeviceHostdevPCIDriverName); struct _virDeviceHostdevPCIDriverInfo { virDeviceHostdevPCIDriverName name; char *model; + virTristateBool iommufd; };
typedef enum { diff --git a/src/conf/schemas/basictypes.rng b/src/conf/schemas/ basictypes.rng index 2931e316b7..089fc0f1c2 100644 --- a/src/conf/schemas/basictypes.rng +++ b/src/conf/schemas/basictypes.rng @@ -673,6 +673,11 @@ <ref name="genericName"/> </attribute> </optional> + <optional> + <attribute name="iommufd"> + <ref name="virYesNo"/> + </attribute> + </optional> <empty/> </element> </define> diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 5a834ef842..95d1c2ee98 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -4753,6 +4753,7 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, g_autofree char *host = virPCIDeviceAddressAsString(&pcisrc->addr); const char *failover_pair_id = NULL; const char *driver = NULL; + const char *iommufdId = NULL; /* 'ramfb' property must be omitted unless it's to be enabled */ bool ramfb = pcisrc->ramfb == VIR_TRISTATE_SWITCH_ON;
@@ -4786,6 +4787,9 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, teaming->persistent) failover_pair_id = teaming->persistent;
+ if (pcisrc->driver.iommufd == VIR_TRISTATE_BOOL_YES) + iommufdId = "iommufd0"; + if (virJSONValueObjectAdd(&props, "s:driver", driver, "s:host", host, @@ -4794,6 +4798,7 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, "S:failover_pair_id", failover_pair_id, "S:display", qemuOnOffAuto(pcisrc-
display), "B:ramfb", ramfb, + "S:iommufd", iommufdId, NULL) < 0) return NULL;
@@ -5210,6 +5215,9 @@ qemuBuildHostdevCommandLine(virCommand *cmd, virQEMUCaps *qemuCaps) { size_t i; + g_autoptr(virJSONValue) props = NULL; + int iommufd = 0; + const char * iommufdId = "iommufd0";
for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevDef *hostdev = def->hostdevs[i]; @@ -5238,6 +5246,17 @@ qemuBuildHostdevCommandLine(virCommand *cmd, if (hostdev->info->type == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED) continue;
+ if (subsys->u.pci.driver.iommufd == VIR_TRISTATE_BOOL_YES && iommufd == 0) { + iommufd = 1; + if (qemuMonitorCreateObjectProps(&props, "iommufd", + iommufdId, + NULL) < 0) + return -1;
Formatting the iommufd object should be in a separate function, e.g. qemuBuildIOMMUFDCommandLine, not here in qemuBuildHostdevCommandLine.
Moving it out also fixes the memory leak caused by reusal of 'props' without freeing them.
Ok, I will separate this out. Thanks, Nathan
Hi Jano, On 12/8/2025 9:58 AM, Ján Tomko wrote:
On a Friday in 2025, Nathan Chen via Devel wrote:
Also, I've applied most of the stuff I say here in my branch, mostly to see if they are possible: https://gitlab.com/janotomko/libvirt/-/tree/iommufd?ref_type=heads git fetch https://gitlab.com/janotomko/libvirt.git iommufd
I have taken parts of your code here for the next revision as they are cleaner than what I came up with. Would it be more correct to put "Signed-off-by: Ján Tomko <jtomko@redhat.com>" or "Suggested-by: Ján Tomko <jtomko@redhat.com>" preceding my sign-off in the commit descriptions? Thanks, Nathan
On a Wednesday in 2025, Nathan Chen wrote:
Hi Jano,
On 12/8/2025 9:58 AM, Ján Tomko wrote:
On a Friday in 2025, Nathan Chen via Devel wrote:
Also, I've applied most of the stuff I say here in my branch, mostly to see if they are possible: https://gitlab.com/janotomko/libvirt/-/tree/iommufd?ref_type=heads git fetch https://gitlab.com/janotomko/libvirt.git iommufd
I have taken parts of your code here for the next revision as they are cleaner than what I came up with. Would it be more correct to put "Signed-off-by: Ján Tomko <jtomko@redhat.com>" or "Suggested-by: Ján Tomko <jtomko@redhat.com>" preceding my sign-off in the commit descriptions?
Depends on how big the code change is. Personally, I would preserve the signoff for longer stuff and use "Suggested-by" for shorter changes. Also, I add my own sign-off when I edit something in a patch before merging, to indicate that possible errors might not have come from just the author of the patch, but also could be mine :) Jano
Thanks, Nathan
Open VFIO FDs from libvirt backend without exposing these FDs to XML users, i.e. one per iommufd hostdev for /dev/vfio/devices/vfioX, and pass the FD to qemu command line. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/conf/domain_conf.h | 2 + src/libvirt_private.syms | 1 + src/qemu/qemu_command.c | 26 ++++++++ src/qemu/qemu_domain.c | 39 ++++++++++++ src/qemu/qemu_domain.h | 17 +++++ src/qemu/qemu_process.c | 130 +++++++++++++++++++++++++++++++++++++++ src/util/virpci.c | 69 +++++++++++++++++++++ src/util/virpci.h | 2 + 8 files changed, 286 insertions(+) diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 4fd8342950..da4ce9fc86 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -364,6 +364,8 @@ struct _virDomainHostdevDef { */ virDomainNetDef *parentnet; + virObject *privateData; + virDomainHostdevMode mode; virDomainStartupPolicy startupPolicy; bool managed; diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 4e57e4a8f6..ed2b0d381e 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -3159,6 +3159,7 @@ virPCIDeviceGetStubDriverName; virPCIDeviceGetStubDriverType; virPCIDeviceGetUnbindFromStub; virPCIDeviceGetUsedBy; +virPCIDeviceGetVfioPath; virPCIDeviceGetVPD; virPCIDeviceHasPCIExpressLink; virPCIDeviceIsAssignable; diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 95d1c2ee98..9b08f66175 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -4756,6 +4756,12 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, const char *iommufdId = NULL; /* 'ramfb' property must be omitted unless it's to be enabled */ bool ramfb = pcisrc->ramfb == VIR_TRISTATE_SWITCH_ON; + bool useIommufd = false; + + if (pcisrc->driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO && + pcisrc->driver.iommufd == VIR_TRISTATE_BOOL_YES) { + useIommufd = true; + } /* caller has to assign proper passthrough driver name */ switch (pcisrc->driver.name) { @@ -4802,6 +4808,17 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, NULL) < 0) return NULL; + if (useIommufd && dev->privateData) { + qemuDomainHostdevPrivate *hostdevPriv = QEMU_DOMAIN_HOSTDEV_PRIVATE(dev); + + if (hostdevPriv->vfioDeviceFd >= 0) { + if (virJSONValueObjectAdd(&props, + "S:fd", g_strdup_printf("%d", hostdevPriv->vfioDeviceFd), + NULL) < 0) + return NULL; + } + } + if (qemuBuildDeviceAddressProps(props, def, dev->info) < 0) return NULL; @@ -5260,6 +5277,15 @@ qemuBuildHostdevCommandLine(virCommand *cmd, if (qemuCommandAddExtDevice(cmd, hostdev->info, def, qemuCaps) < 0) return -1; + if (subsys->u.pci.driver.iommufd == VIR_TRISTATE_BOOL_YES) { + qemuDomainHostdevPrivate *hostdevPriv = QEMU_DOMAIN_HOSTDEV_PRIVATE(hostdev); + + if (hostdevPriv && hostdevPriv->vfioDeviceFd >= 0) { + virCommandPassFD(cmd, hostdevPriv->vfioDeviceFd, + VIR_COMMAND_PASS_FD_CLOSE_PARENT); + } + } + if (!(devprops = qemuBuildPCIHostdevDevProps(def, hostdev))) return -1; diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index ac56fc7cb4..7601bdbb2b 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -1238,6 +1238,45 @@ qemuDomainNetworkPrivateFormat(const virDomainNetDef *net, } +static virClass *qemuDomainHostdevPrivateClass; + +static void +qemuDomainHostdevPrivateDispose(void *obj) +{ + qemuDomainHostdevPrivate *priv = obj; + + VIR_FORCE_CLOSE(priv->vfioDeviceFd); +} + + +static int +qemuDomainHostdevPrivateOnceInit(void) +{ + if (!VIR_CLASS_NEW(qemuDomainHostdevPrivate, virClassForObject())) + return -1; + + return 0; +} + +VIR_ONCE_GLOBAL_INIT(qemuDomainHostdevPrivate); + +virObject * +qemuDomainHostdevPrivateNew(void) +{ + qemuDomainHostdevPrivate *priv; + + if (qemuDomainHostdevPrivateInitialize() < 0) + return NULL; + + if (!(priv = virObjectNew(qemuDomainHostdevPrivateClass))) + return NULL; + + priv->vfioDeviceFd = -1; + + return (virObject *) priv; +} + + /* qemuDomainSecretInfoSetup: * @priv: pointer to domain private object * @alias: alias of the secret diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 3396f929fd..4736f1ede5 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -461,6 +461,17 @@ struct _qemuDomainTPMPrivate { }; +#define QEMU_DOMAIN_HOSTDEV_PRIVATE(hostdev) \ + ((qemuDomainHostdevPrivate *) (hostdev)->privateData) + +typedef struct _qemuDomainHostdevPrivate qemuDomainHostdevPrivate; +struct _qemuDomainHostdevPrivate { + virObject parent; + + /* VFIO device file descriptor for iommufd passthrough */ + int vfioDeviceFd; +}; + void qemuDomainNetworkPrivateClearFDs(qemuDomainNetworkPrivate *priv); @@ -1174,3 +1185,9 @@ qemuDomainCheckCPU(virArch arch, bool qemuDomainMachineSupportsFloppy(const char *machine, virQEMUCaps *qemuCaps); + +virObject * +qemuDomainHostdevPrivateNew(void); + +int qemuProcessOpenVfioFds(virDomainObj *vm); +void qemuProcessCloseVfioFds(virDomainObj *vm); diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 45fc32a663..bf245ee8af 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -106,6 +106,7 @@ #include "logging/log_manager.h" #include "logging/log_protocol.h" +#include "util/virpci.h" #define VIR_FROM_THIS VIR_FROM_QEMU @@ -8091,6 +8092,9 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuExtDevicesStart(driver, vm, incomingMigrationExtDevices) < 0) goto cleanup; + if (qemuProcessOpenVfioFds(vm) < 0) + goto cleanup; + if (!(cmd = qemuBuildCommandLine(vm, incoming ? "defer" : NULL, vmop, @@ -10267,3 +10271,129 @@ qemuProcessHandleNbdkitExit(qemuNbdkitProcess *nbdkit, qemuProcessEventSubmit(vm, QEMU_PROCESS_EVENT_NBDKIT_EXITED, 0, 0, nbdkit); virObjectUnlock(vm); } + +/** + * qemuProcessOpenVfioDeviceFd: + * @hostdev: host device definition + * @vfioFd: returned file descriptor + * + * Opens the VFIO device file descriptor for a hostdev. + * + * Returns: 0 on success, -1 on failure + */ +static int +qemuProcessOpenVfioDeviceFd(virDomainHostdevDef *hostdev, + int *vfioFd) +{ + g_autofree char *vfioPath = NULL; + int fd = -1; + + + if (hostdev->mode != VIR_DOMAIN_HOSTDEV_MODE_SUBSYS || + hostdev->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("VFIO FD only supported for PCI hostdevs")); + return -1; + } + + if (virPCIDeviceGetVfioPath(&hostdev->source.subsys.u.pci.addr, &vfioPath) < 0) + return -1; + + VIR_DEBUG("Opening VFIO device %s", vfioPath); + + if ((fd = open(vfioPath, O_RDWR | O_CLOEXEC)) < 0) { + if (errno == ENOENT) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("VFIO device %1$s not found - ensure device is bound to vfio-pci driver"), + vfioPath); + } else { + virReportSystemError(errno, + _("cannot open VFIO device %1$s"), vfioPath); + } + return -1; + } + + *vfioFd = fd; + VIR_DEBUG("Opened VFIO device FD %d for %s", *vfioFd, vfioPath); + return 0; +} + +/** + * qemuProcessOpenVfioFds: + * @vm: domain object + * + * Opens all necessary VFIO file descriptors for the domain. + * + * Returns: 0 on success, -1 on failure + */ +int +qemuProcessOpenVfioFds(virDomainObj *vm) +{ + size_t i; + + /* Check if we have any hostdevs that need VFIO FDs */ + for (i = 0; i < vm->def->nhostdevs; i++) { + virDomainHostdevDef *hostdev = vm->def->hostdevs[i]; + qemuDomainHostdevPrivate *hostdevPriv = NULL; + + if (hostdev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && + hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { + + if (hostdev->source.subsys.u.pci.driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO && + hostdev->source.subsys.u.pci.driver.iommufd == VIR_TRISTATE_BOOL_YES) { + + if (!hostdev->privateData) { + if (!(hostdev->privateData = qemuDomainHostdevPrivateNew())) + goto error; + } + + hostdevPriv = QEMU_DOMAIN_HOSTDEV_PRIVATE(hostdev); + + /* Open VFIO device FD */ + if (qemuProcessOpenVfioDeviceFd(hostdev, &hostdevPriv->vfioDeviceFd) < 0) + goto error; + + VIR_DEBUG("Stored VFIO FD %d in hostdev %04x:%02x:%02x.%d private data", + hostdevPriv->vfioDeviceFd, + hostdev->source.subsys.u.pci.addr.domain, + hostdev->source.subsys.u.pci.addr.bus, + hostdev->source.subsys.u.pci.addr.slot, + hostdev->source.subsys.u.pci.addr.function); + } + } + } + + return 0; + + error: + qemuProcessCloseVfioFds(vm); + return -1; +} + +/** + * qemuProcessCloseVfioFds: + * @vm: domain object + * + * Closes all VFIO file descriptors for the domain. + */ +void +qemuProcessCloseVfioFds(virDomainObj *vm) +{ + size_t i; + + /* Close all VFIO device FDs */ + for (i = 0; i < vm->def->nhostdevs; i++) { + virDomainHostdevDef *hostdev = vm->def->hostdevs[i]; + qemuDomainHostdevPrivate *hostdevPriv; + + if (!hostdev->privateData) + continue; + + hostdevPriv = QEMU_DOMAIN_HOSTDEV_PRIVATE(hostdev); + + if (hostdevPriv->vfioDeviceFd >= 0) { + VIR_DEBUG("Closing VFIO device FD %d", hostdevPriv->vfioDeviceFd); + VIR_FORCE_CLOSE(hostdevPriv->vfioDeviceFd); + } + } +} diff --git a/src/util/virpci.c b/src/util/virpci.c index 90617e69c6..da62ece0f6 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -3320,3 +3320,72 @@ virPCIDeviceAddressFree(virPCIDeviceAddress *address) { g_free(address); } + +/** + * virPCIDeviceGetVfioPath: + * @addr: host device PCI address + * @vfioPath: returned VFIO device path + * + * Constructs the VFIO device path for a PCI hostdev. + * + * Returns: 0 on success, -1 on failure + */ +int +virPCIDeviceGetVfioPath(virPCIDeviceAddress *addr, + char **vfioPath) +{ + g_autofree char *addrStr = NULL; + + *vfioPath = NULL; + addrStr = virPCIDeviceAddressAsString(addr); + + /* First try: Direct lookup in device's vfio-dev subdirectory */ + { + g_autofree char *sysfsPath = NULL; + g_autoptr(DIR) dir = NULL; + struct dirent *entry = NULL; + + sysfsPath = g_strdup_printf("/sys/bus/pci/devices/%s/vfio-dev/", addrStr); + + if (virDirOpen(&dir, sysfsPath) == 1) { + while (virDirRead(dir, &entry, sysfsPath) > 0) { + if (STRPREFIX(entry->d_name, "vfio")) { + *vfioPath = g_strdup_printf("/dev/vfio/devices/%s", entry->d_name); + return 0; + } + } + } + } + + /* Second try: Scan /sys/class/vfio-dev */ + { + g_autofree char *sysfsPath = g_strdup("/sys/class/vfio-dev"); + g_autoptr(DIR) dir = NULL; + struct dirent *entry = NULL; + + if (virDirOpen(&dir, sysfsPath) == 1) { + while (virDirRead(dir, &entry, sysfsPath) > 0) { + g_autofree char *devLink = NULL; + g_autofree char *target = NULL; + + if (!STRPREFIX(entry->d_name, "vfio")) + continue; + + devLink = g_strdup_printf("/sys/class/vfio-dev/%s/device", entry->d_name); + + if (virFileResolveLink(devLink, &target) < 0) + continue; + + if (strstr(target, addrStr)) { + *vfioPath = g_strdup_printf("/dev/vfio/devices/%s", entry->d_name); + return 0; + } + } + } + } + + virReportError(VIR_ERR_INTERNAL_ERROR, + _("cannot find VFIO device for PCI device %1$s"), + addrStr); + return -1; +} diff --git a/src/util/virpci.h b/src/util/virpci.h index fc538566e1..24ede10755 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -296,6 +296,8 @@ void virPCIEDeviceInfoFree(virPCIEDeviceInfo *dev); void virPCIDeviceAddressFree(virPCIDeviceAddress *address); +int virPCIDeviceGetVfioPath(virPCIDeviceAddress *addr, char **vfioPath); + G_DEFINE_AUTOPTR_CLEANUP_FUNC(virPCIDevice, virPCIDeviceFree); G_DEFINE_AUTOPTR_CLEANUP_FUNC(virPCIDeviceAddress, virPCIDeviceAddressFree); G_DEFINE_AUTOPTR_CLEANUP_FUNC(virPCIEDeviceInfo, virPCIEDeviceInfoFree); -- 2.43.0
On a Friday in 2025, Nathan Chen via Devel wrote:
Open VFIO FDs from libvirt backend without exposing these FDs to XML users, i.e. one per iommufd hostdev for /dev/vfio/devices/vfioX, and pass the FD to qemu command line.
Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/conf/domain_conf.h | 2 + src/libvirt_private.syms | 1 + src/qemu/qemu_command.c | 26 ++++++++ src/qemu/qemu_domain.c | 39 ++++++++++++ src/qemu/qemu_domain.h | 17 +++++ src/qemu/qemu_process.c | 130 +++++++++++++++++++++++++++++++++++++++ src/util/virpci.c | 69 +++++++++++++++++++++ src/util/virpci.h | 2 + 8 files changed, 286 insertions(+)
diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 4fd8342950..da4ce9fc86 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -364,6 +364,8 @@ struct _virDomainHostdevDef { */ virDomainNetDef *parentnet;
+ virObject *privateData; +
The code adding the privateData should be in a separate patch. (done in my gitlab branch)
virDomainHostdevMode mode; virDomainStartupPolicy startupPolicy; bool managed; diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 4e57e4a8f6..ed2b0d381e 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -3159,6 +3159,7 @@ virPCIDeviceGetStubDriverName; virPCIDeviceGetStubDriverType; virPCIDeviceGetUnbindFromStub; virPCIDeviceGetUsedBy; +virPCIDeviceGetVfioPath; virPCIDeviceGetVPD; virPCIDeviceHasPCIExpressLink; virPCIDeviceIsAssignable; diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 95d1c2ee98..9b08f66175 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -4756,6 +4756,12 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, const char *iommufdId = NULL; /* 'ramfb' property must be omitted unless it's to be enabled */ bool ramfb = pcisrc->ramfb == VIR_TRISTATE_SWITCH_ON; + bool useIommufd = false; + + if (pcisrc->driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO && + pcisrc->driver.iommufd == VIR_TRISTATE_BOOL_YES) { + useIommufd = true;
No need for this separate bool, just move the two conditions where the bool is used.
+ }
/* caller has to assign proper passthrough driver name */ switch (pcisrc->driver.name) { @@ -4802,6 +4808,17 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, NULL) < 0) return NULL;
+ if (useIommufd && dev->privateData) {
privateData should be allocated unconditionally for all hostdevs, once that is done, it does not need to be checked here.
+ qemuDomainHostdevPrivate *hostdevPriv = QEMU_DOMAIN_HOSTDEV_PRIVATE(dev); + + if (hostdevPriv->vfioDeviceFd >= 0) {
For 'fd's, a != -1 comparison is preferred (any other negative value is a programming error)
+ if (virJSONValueObjectAdd(&props, + "S:fd", g_strdup_printf("%d", hostdevPriv->vfioDeviceFd),
g_strdup_printf returns an allocated string, but virJSONValueObjectAdd does not "take ownership" of it. Try: g_autofree char *fdstr = g_strdup(..); at the start of the 'if' block and use 'fdstr' instead'
+ NULL) < 0) + return NULL; + } + } + if (qemuBuildDeviceAddressProps(props, def, dev->info) < 0) return NULL;
@@ -5260,6 +5277,15 @@ qemuBuildHostdevCommandLine(virCommand *cmd, if (qemuCommandAddExtDevice(cmd, hostdev->info, def, qemuCaps) < 0) return -1;
+ if (subsys->u.pci.driver.iommufd == VIR_TRISTATE_BOOL_YES) { + qemuDomainHostdevPrivate *hostdevPriv = QEMU_DOMAIN_HOSTDEV_PRIVATE(hostdev); + + if (hostdevPriv && hostdevPriv->vfioDeviceFd >= 0) {
No need to check for hostdevPriv if we allocate it for all hostdevs.
+ virCommandPassFD(cmd, hostdevPriv->vfioDeviceFd, + VIR_COMMAND_PASS_FD_CLOSE_PARENT); + } + } + if (!(devprops = qemuBuildPCIHostdevDevProps(def, hostdev))) return -1;
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index ac56fc7cb4..7601bdbb2b 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -1238,6 +1238,45 @@ qemuDomainNetworkPrivateFormat(const virDomainNetDef *net, }
+static virClass *qemuDomainHostdevPrivateClass; + +static void +qemuDomainHostdevPrivateDispose(void *obj) +{ + qemuDomainHostdevPrivate *priv = obj; + + VIR_FORCE_CLOSE(priv->vfioDeviceFd); +} + + +static int +qemuDomainHostdevPrivateOnceInit(void) +{ + if (!VIR_CLASS_NEW(qemuDomainHostdevPrivate, virClassForObject())) + return -1; + + return 0; +} + +VIR_ONCE_GLOBAL_INIT(qemuDomainHostdevPrivate); + +virObject * +qemuDomainHostdevPrivateNew(void)
This functions should also be added to virQEMUDriverPrivateDataCallbacks as .hostdevNew, then it can be called from the parser: virDomainHostdevDefNew(virDomainXMLOption *xmlopt) { ... if (xmlopt && xmlopt->privateData.hostdevNew && !(def->privateData = xmlopt->privateData.hostdevNew())) { VIR_FREE(def->info); VIR_FREE(def); return NULL; } ... } (Yes, the callers would need to be adjusted too)
+{ + qemuDomainHostdevPrivate *priv; + + if (qemuDomainHostdevPrivateInitialize() < 0) ... diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 45fc32a663..bf245ee8af 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -106,6 +106,7 @@
#include "logging/log_manager.h" #include "logging/log_protocol.h" +#include "util/virpci.h"
No need for the util prefix. Also, can be grouped in the previous group.
#define VIR_FROM_THIS VIR_FROM_QEMU
@@ -8091,6 +8092,9 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuExtDevicesStart(driver, vm, incomingMigrationExtDevices) < 0) goto cleanup;
+ if (qemuProcessOpenVfioFds(vm) < 0) + goto cleanup; + if (!(cmd = qemuBuildCommandLine(vm, incoming ? "defer" : NULL, vmop, @@ -10267,3 +10271,129 @@ qemuProcessHandleNbdkitExit(qemuNbdkitProcess *nbdkit, qemuProcessEventSubmit(vm, QEMU_PROCESS_EVENT_NBDKIT_EXITED, 0, 0, nbdkit); virObjectUnlock(vm); } + +/** + * qemuProcessOpenVfioDeviceFd: + * @hostdev: host device definition + * @vfioFd: returned file descriptor + * + * Opens the VFIO device file descriptor for a hostdev. + * + * Returns: 0 on success, -1 on failure
This can return the fd on success directly, no need to return it via pointer.
+ */ +static int +qemuProcessOpenVfioDeviceFd(virDomainHostdevDef *hostdev, + int *vfioFd) +{ + g_autofree char *vfioPath = NULL; + int fd = -1; + + + if (hostdev->mode != VIR_DOMAIN_HOSTDEV_MODE_SUBSYS || + hostdev->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("VFIO FD only supported for PCI hostdevs")); + return -1; + } + + if (virPCIDeviceGetVfioPath(&hostdev->source.subsys.u.pci.addr, &vfioPath) < 0) + return -1; + + VIR_DEBUG("Opening VFIO device %s", vfioPath); + + if ((fd = open(vfioPath, O_RDWR | O_CLOEXEC)) < 0) { + if (errno == ENOENT) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("VFIO device %1$s not found - ensure device is bound to vfio-pci driver"), + vfioPath); + } else { + virReportSystemError(errno, + _("cannot open VFIO device %1$s"), vfioPath); + } + return -1; + } + + *vfioFd = fd; + VIR_DEBUG("Opened VFIO device FD %d for %s", *vfioFd, vfioPath); + return 0; +} + +/** + * qemuProcessOpenVfioFds: + * @vm: domain object + * + * Opens all necessary VFIO file descriptors for the domain. + * + * Returns: 0 on success, -1 on failure + */ +int +qemuProcessOpenVfioFds(virDomainObj *vm) +{ + size_t i; + + /* Check if we have any hostdevs that need VFIO FDs */ + for (i = 0; i < vm->def->nhostdevs; i++) { + virDomainHostdevDef *hostdev = vm->def->hostdevs[i]; + qemuDomainHostdevPrivate *hostdevPriv = NULL; +
+ if (hostdev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && + hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { + + if (hostdev->source.subsys.u.pci.driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO && + hostdev->source.subsys.u.pci.driver.iommufd == VIR_TRISTATE_BOOL_YES) { +
These two conditions above can be joint into one to save one level of indentation
+ if (!hostdev->privateData) { + if (!(hostdev->privateData = qemuDomainHostdevPrivateNew())) + goto error; + } +
This should be allocated in virDomainHostdevDefNew.
+ hostdevPriv = QEMU_DOMAIN_HOSTDEV_PRIVATE(hostdev); + + /* Open VFIO device FD */ + if (qemuProcessOpenVfioDeviceFd(hostdev, &hostdevPriv->vfioDeviceFd) < 0) + goto error; +
+ VIR_DEBUG("Stored VFIO FD %d in hostdev %04x:%02x:%02x.%d private data", + hostdevPriv->vfioDeviceFd, + hostdev->source.subsys.u.pci.addr.domain, + hostdev->source.subsys.u.pci.addr.bus, + hostdev->source.subsys.u.pci.addr.slot, + hostdev->source.subsys.u.pci.addr.function);
No need for this debug log since the function above just logged the same info
+ } + } + } + + return 0; + + error: + qemuProcessCloseVfioFds(vm);
No need to close the file descriptors here, they will be closed when the private data gets destroyed along with the rest of the live domain definition (i.e. the copy libvirt makes on domain startup)
+ return -1; +} + +/** + * qemuProcessCloseVfioFds: + * @vm: domain object + * + * Closes all VFIO file descriptors for the domain. + */ +void +qemuProcessCloseVfioFds(virDomainObj *vm)
This function is not necessary
+{ + size_t i; + + /* Close all VFIO device FDs */ + for (i = 0; i < vm->def->nhostdevs; i++) { + virDomainHostdevDef *hostdev = vm->def->hostdevs[i]; + qemuDomainHostdevPrivate *hostdevPriv; + + if (!hostdev->privateData) + continue; + + hostdevPriv = QEMU_DOMAIN_HOSTDEV_PRIVATE(hostdev); + + if (hostdevPriv->vfioDeviceFd >= 0) { + VIR_DEBUG("Closing VFIO device FD %d", hostdevPriv->vfioDeviceFd); + VIR_FORCE_CLOSE(hostdevPriv->vfioDeviceFd); + } + } +}
Jano
On 12/8/2025 10:16 AM, Ján Tomko wrote:
On a Friday in 2025, Nathan Chen via Devel wrote:
Open VFIO FDs from libvirt backend without exposing these FDs to XML users, i.e. one per iommufd hostdev for /dev/vfio/devices/vfioX, and pass the FD to qemu command line.
Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/conf/domain_conf.h | 2 + src/libvirt_private.syms | 1 + src/qemu/qemu_command.c | 26 ++++++++ src/qemu/qemu_domain.c | 39 ++++++++++++ src/qemu/qemu_domain.h | 17 +++++ src/qemu/qemu_process.c | 130 +++++++++++++++++++++++++++++++++++++++ src/util/virpci.c | 69 +++++++++++++++++++++ src/util/virpci.h | 2 + 8 files changed, 286 insertions(+)
diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 4fd8342950..da4ce9fc86 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -364,6 +364,8 @@ struct _virDomainHostdevDef { */ virDomainNetDef *parentnet;
+ virObject *privateData; +
The code adding the privateData should be in a separate patch. (done in my gitlab branch)
Ok, I will separate it out for the next revision.
virDomainHostdevMode mode; virDomainStartupPolicy startupPolicy; bool managed; diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 4e57e4a8f6..ed2b0d381e 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -3159,6 +3159,7 @@ virPCIDeviceGetStubDriverName; virPCIDeviceGetStubDriverType; virPCIDeviceGetUnbindFromStub; virPCIDeviceGetUsedBy; +virPCIDeviceGetVfioPath; virPCIDeviceGetVPD; virPCIDeviceHasPCIExpressLink; virPCIDeviceIsAssignable; diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 95d1c2ee98..9b08f66175 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -4756,6 +4756,12 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, const char *iommufdId = NULL; /* 'ramfb' property must be omitted unless it's to be enabled */ bool ramfb = pcisrc->ramfb == VIR_TRISTATE_SWITCH_ON; + bool useIommufd = false; + + if (pcisrc->driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO && + pcisrc->driver.iommufd == VIR_TRISTATE_BOOL_YES) { + useIommufd = true;
No need for this separate bool, just move the two conditions where the bool is used.
Ok, I will replace this separate bool.
+ }
/* caller has to assign proper passthrough driver name */ switch (pcisrc->driver.name) { @@ -4802,6 +4808,17 @@ qemuBuildPCIHostdevDevProps(const virDomainDef *def, NULL) < 0) return NULL;
+ if (useIommufd && dev->privateData) {
privateData should be allocated unconditionally for all hostdevs, once that is done, it does not need to be checked here.
Ok, that makes sense.
+ qemuDomainHostdevPrivate *hostdevPriv = QEMU_DOMAIN_HOSTDEV_PRIVATE(dev); + + if (hostdevPriv->vfioDeviceFd >= 0) {
For 'fd's, a != -1 comparison is preferred (any other negative value is a programming error)
Ok, I will change this to compare with -1.
+ if (virJSONValueObjectAdd(&props, + "S:fd", g_strdup_printf("%d", hostdevPriv->vfioDeviceFd),
g_strdup_printf returns an allocated string, but virJSONValueObjectAdd does not "take ownership" of it.
Try: g_autofree char *fdstr = g_strdup(..); at the start of the 'if' block and use 'fdstr' instead'
I will look into this for the next revision.
+ NULL) < 0) + return NULL; + } + } + if (qemuBuildDeviceAddressProps(props, def, dev->info) < 0) return NULL;
@@ -5260,6 +5277,15 @@ qemuBuildHostdevCommandLine(virCommand *cmd, if (qemuCommandAddExtDevice(cmd, hostdev->info, def, qemuCaps) < 0) return -1;
+ if (subsys->u.pci.driver.iommufd == VIR_TRISTATE_BOOL_YES) { + qemuDomainHostdevPrivate *hostdevPriv = QEMU_DOMAIN_HOSTDEV_PRIVATE(hostdev); + + if (hostdevPriv && hostdevPriv->vfioDeviceFd >= 0) {
No need to check for hostdevPriv if we allocate it for all hostdevs.
+ virCommandPassFD(cmd, hostdevPriv->vfioDeviceFd, + VIR_COMMAND_PASS_FD_CLOSE_PARENT); + } + } + if (!(devprops = qemuBuildPCIHostdevDevProps(def, hostdev))) return -1;
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index ac56fc7cb4..7601bdbb2b 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -1238,6 +1238,45 @@ qemuDomainNetworkPrivateFormat(const virDomainNetDef *net, }
+static virClass *qemuDomainHostdevPrivateClass; + +static void +qemuDomainHostdevPrivateDispose(void *obj) +{ + qemuDomainHostdevPrivate *priv = obj; + + VIR_FORCE_CLOSE(priv->vfioDeviceFd); +} + + +static int +qemuDomainHostdevPrivateOnceInit(void) +{ + if (!VIR_CLASS_NEW(qemuDomainHostdevPrivate, virClassForObject())) + return -1; + + return 0; +} + +VIR_ONCE_GLOBAL_INIT(qemuDomainHostdevPrivate); + +virObject * +qemuDomainHostdevPrivateNew(void)
This functions should also be added to virQEMUDriverPrivateDataCallbacks as .hostdevNew, then it can be called from the parser:
virDomainHostdevDefNew(virDomainXMLOption *xmlopt) { ... if (xmlopt && xmlopt->privateData.hostdevNew && !(def->privateData = xmlopt->privateData.hostdevNew())) { VIR_FREE(def->info); VIR_FREE(def); return NULL; } ... }
(Yes, the callers would need to be adjusted too)
Ok, I will set this up in the next revision.
+{ + qemuDomainHostdevPrivate *priv; + + if (qemuDomainHostdevPrivateInitialize() < 0) ... diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 45fc32a663..bf245ee8af 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -106,6 +106,7 @@
#include "logging/log_manager.h" #include "logging/log_protocol.h" +#include "util/virpci.h"
No need for the util prefix. Also, can be grouped in the previous group.
Ok, I will fix this.
#define VIR_FROM_THIS VIR_FROM_QEMU
@@ -8091,6 +8092,9 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuExtDevicesStart(driver, vm, incomingMigrationExtDevices) < 0) goto cleanup;
+ if (qemuProcessOpenVfioFds(vm) < 0) + goto cleanup; + if (!(cmd = qemuBuildCommandLine(vm, incoming ? "defer" : NULL, vmop, @@ -10267,3 +10271,129 @@ qemuProcessHandleNbdkitExit(qemuNbdkitProcess *nbdkit, qemuProcessEventSubmit(vm, QEMU_PROCESS_EVENT_NBDKIT_EXITED, 0, 0, nbdkit); virObjectUnlock(vm); } + +/** + * qemuProcessOpenVfioDeviceFd: + * @hostdev: host device definition + * @vfioFd: returned file descriptor + * + * Opens the VFIO device file descriptor for a hostdev. + * + * Returns: 0 on success, -1 on failure
This can return the fd on success directly, no need to return it via pointer.
Got it, makes sense.
+ */ +static int +qemuProcessOpenVfioDeviceFd(virDomainHostdevDef *hostdev, + int *vfioFd) +{ + g_autofree char *vfioPath = NULL; + int fd = -1; + + + if (hostdev->mode != VIR_DOMAIN_HOSTDEV_MODE_SUBSYS || + hostdev->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("VFIO FD only supported for PCI hostdevs")); + return -1; + } + + if (virPCIDeviceGetVfioPath(&hostdev->source.subsys.u.pci.addr, &vfioPath) < 0) + return -1; + + VIR_DEBUG("Opening VFIO device %s", vfioPath); + + if ((fd = open(vfioPath, O_RDWR | O_CLOEXEC)) < 0) { + if (errno == ENOENT) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("VFIO device %1$s not found - ensure device is bound to vfio-pci driver"), + vfioPath); + } else { + virReportSystemError(errno, + _("cannot open VFIO device %1$s"), vfioPath); + } + return -1; + } + + *vfioFd = fd; + VIR_DEBUG("Opened VFIO device FD %d for %s", *vfioFd, vfioPath); + return 0; +} + +/** + * qemuProcessOpenVfioFds: + * @vm: domain object + * + * Opens all necessary VFIO file descriptors for the domain. + * + * Returns: 0 on success, -1 on failure + */ +int +qemuProcessOpenVfioFds(virDomainObj *vm) +{ + size_t i; + + /* Check if we have any hostdevs that need VFIO FDs */ + for (i = 0; i < vm->def->nhostdevs; i++) { + virDomainHostdevDef *hostdev = vm->def->hostdevs[i]; + qemuDomainHostdevPrivate *hostdevPriv = NULL; +
+ if (hostdev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && + hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { + + if (hostdev->source.subsys.u.pci.driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO && + hostdev->source.subsys.u.pci.driver.iommufd == VIR_TRISTATE_BOOL_YES) { +
These two conditions above can be joint into one to save one level of indentation
Ok, I will put these conditions into one check.
+ if (!hostdev->privateData) { + if (!(hostdev->privateData = qemuDomainHostdevPrivateNew())) + goto error; + } +
This should be allocated in virDomainHostdevDefNew.
Got it, I will allocate this unconditionally for all hostdevs.
+ hostdevPriv = QEMU_DOMAIN_HOSTDEV_PRIVATE(hostdev); + + /* Open VFIO device FD */ + if (qemuProcessOpenVfioDeviceFd(hostdev, &hostdevPriv->vfioDeviceFd) < 0) + goto error; +
+ VIR_DEBUG("Stored VFIO FD %d in hostdev %04x:%02x: %02x.%d private data", + hostdevPriv->vfioDeviceFd, + hostdev->source.subsys.u.pci.addr.domain, + hostdev->source.subsys.u.pci.addr.bus, + hostdev->source.subsys.u.pci.addr.slot, + hostdev->source.subsys.u.pci.addr.function);
No need for this debug log since the function above just logged the same info
Ok, I will remove it.
+ } + } + } + + return 0; + + error: + qemuProcessCloseVfioFds(vm);
No need to close the file descriptors here, they will be closed when the private data gets destroyed along with the rest of the live domain definition (i.e. the copy libvirt makes on domain startup)
+ return -1; +} + +/** + * qemuProcessCloseVfioFds: + * @vm: domain object + * + * Closes all VFIO file descriptors for the domain. + */ +void +qemuProcessCloseVfioFds(virDomainObj *vm)
This function is not necessary Ok, I will remove this function and references to it.
Thanks, Nathan
Open iommufd FD from libvirt backend without exposing these FDs to XML users, i.e. one per domain for /dev/iommu, and pass the FD to qemu command line. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/qemu/qemu_command.c | 8 ++++-- src/qemu/qemu_domain.c | 1 + src/qemu/qemu_domain.h | 2 ++ src/qemu/qemu_process.c | 56 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 65 insertions(+), 2 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 9b08f66175..99c310cf31 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -5229,12 +5229,14 @@ qemuBuildAcpiNodesetProps(virCommand *cmd, static int qemuBuildHostdevCommandLine(virCommand *cmd, const virDomainDef *def, - virQEMUCaps *qemuCaps) + virQEMUCaps *qemuCaps, + virDomainObj *vm) { size_t i; g_autoptr(virJSONValue) props = NULL; int iommufd = 0; const char * iommufdId = "iommufd0"; + qemuDomainObjPrivate *priv = vm->privateData; for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevDef *hostdev = def->hostdevs[i]; @@ -5265,8 +5267,10 @@ qemuBuildHostdevCommandLine(virCommand *cmd, if (subsys->u.pci.driver.iommufd == VIR_TRISTATE_BOOL_YES && iommufd == 0) { iommufd = 1; + virCommandPassFD(cmd, priv->iommufd, VIR_COMMAND_PASS_FD_CLOSE_PARENT); if (qemuMonitorCreateObjectProps(&props, "iommufd", iommufdId, + "S:fd", g_strdup_printf("%d", priv->iommufd), NULL) < 0) return -1; @@ -10967,7 +10971,7 @@ qemuBuildCommandLine(virDomainObj *vm, if (qemuBuildRedirdevCommandLine(cmd, def, qemuCaps) < 0) return NULL; - if (qemuBuildHostdevCommandLine(cmd, def, qemuCaps) < 0) + if (qemuBuildHostdevCommandLine(cmd, def, qemuCaps, vm) < 0) return NULL; if (migrateURI) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 7601bdbb2b..d569dd5ad9 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -2042,6 +2042,7 @@ qemuDomainObjPrivateAlloc(void *opaque) priv->blockjobs = virHashNew(virObjectUnref); priv->fds = virHashNew(g_object_unref); + priv->iommufd = -1; priv->pidMonitored = -1; /* agent commands block by default, user can choose different behavior */ diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 4736f1ede5..e55ba1c968 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -264,6 +264,8 @@ struct _qemuDomainObjPrivate { /* named file descriptor groups associated with the VM */ GHashTable *fds; + int iommufd; + char *memoryBackingDir; }; diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index bf245ee8af..83b8a586a1 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -10272,6 +10272,38 @@ qemuProcessHandleNbdkitExit(qemuNbdkitProcess *nbdkit, virObjectUnlock(vm); } +/** + * qemuProcessOpenIommuFd: + * @vm: domain object + * @iommuFd: returned file descriptor + * + * Opens /dev/iommu file descriptor for the VM. + * + * Returns: 0 on success, -1 on failure + */ +static int +qemuProcessOpenIommuFd(virDomainObj *vm, int *iommuFd) +{ + int fd = -1; + + VIR_DEBUG("Opening IOMMU FD for domain %s", vm->def->name); + + if ((fd = open("/dev/iommu", O_RDWR | O_CLOEXEC)) < 0) { + if (errno == ENOENT) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("IOMMU FD support requires /dev/iommu device")); + } else { + virReportSystemError(errno, "%s", + _("cannot open /dev/iommu")); + } + return -1; + } + + *iommuFd = fd; + VIR_DEBUG("Opened IOMMU FD %d for domain %s", fd, vm->def->name); + return 0; +} + /** * qemuProcessOpenVfioDeviceFd: * @hostdev: host device definition @@ -10329,6 +10361,8 @@ qemuProcessOpenVfioDeviceFd(virDomainHostdevDef *hostdev, int qemuProcessOpenVfioFds(virDomainObj *vm) { + qemuDomainObjPrivate *priv = vm->privateData; + bool needsIommuFd = false; size_t i; /* Check if we have any hostdevs that need VFIO FDs */ @@ -10342,6 +10376,8 @@ qemuProcessOpenVfioFds(virDomainObj *vm) if (hostdev->source.subsys.u.pci.driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO && hostdev->source.subsys.u.pci.driver.iommufd == VIR_TRISTATE_BOOL_YES) { + needsIommuFd = true; + if (!hostdev->privateData) { if (!(hostdev->privateData = qemuDomainHostdevPrivateNew())) goto error; @@ -10363,6 +10399,18 @@ qemuProcessOpenVfioFds(virDomainObj *vm) } } + /* Open IOMMU FD if needed */ + if (needsIommuFd) { + int iommuFd = -1; + + if (qemuProcessOpenIommuFd(vm, &iommuFd) < 0) + goto error; + + priv->iommufd = iommuFd; + + VIR_DEBUG("Stored IOMMU FD %d", priv->iommufd); + } + return 0; error: @@ -10379,6 +10427,7 @@ qemuProcessOpenVfioFds(virDomainObj *vm) void qemuProcessCloseVfioFds(virDomainObj *vm) { + qemuDomainObjPrivate *priv = vm->privateData; size_t i; /* Close all VFIO device FDs */ @@ -10396,4 +10445,11 @@ qemuProcessCloseVfioFds(virDomainObj *vm) VIR_FORCE_CLOSE(hostdevPriv->vfioDeviceFd); } } + + /* Close IOMMU FD */ + if (priv->iommufd >= 0) { + VIR_DEBUG("Closing IOMMU FD %d", priv->iommufd); + VIR_FORCE_CLOSE(priv->iommufd); + priv->iommufd = -1; + } } -- 2.43.0
On a Friday in 2025, Nathan Chen via Devel wrote:
Open iommufd FD from libvirt backend without exposing these FDs to XML users, i.e. one per domain for /dev/iommu, and pass the FD to qemu command line.
Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/qemu/qemu_command.c | 8 ++++-- src/qemu/qemu_domain.c | 1 + src/qemu/qemu_domain.h | 2 ++ src/qemu/qemu_process.c | 56 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 65 insertions(+), 2 deletions(-)
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 9b08f66175..99c310cf31 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -5229,12 +5229,14 @@ qemuBuildAcpiNodesetProps(virCommand *cmd, static int qemuBuildHostdevCommandLine(virCommand *cmd, const virDomainDef *def, - virQEMUCaps *qemuCaps) + virQEMUCaps *qemuCaps, + virDomainObj *vm) { size_t i; g_autoptr(virJSONValue) props = NULL; int iommufd = 0; const char * iommufdId = "iommufd0"; + qemuDomainObjPrivate *priv = vm->privateData;
for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevDef *hostdev = def->hostdevs[i]; @@ -5265,8 +5267,10 @@ qemuBuildHostdevCommandLine(virCommand *cmd,
if (subsys->u.pci.driver.iommufd == VIR_TRISTATE_BOOL_YES && iommufd == 0) { iommufd = 1; + virCommandPassFD(cmd, priv->iommufd, VIR_COMMAND_PASS_FD_CLOSE_PARENT); if (qemuMonitorCreateObjectProps(&props, "iommufd", iommufdId, + "S:fd", g_strdup_printf("%d", priv->iommufd),
Same comment about g_strdup_printf as with the other fd: formatting. Also, this change would be in a different function now.
NULL) < 0) return -1;
@@ -10967,7 +10971,7 @@ qemuBuildCommandLine(virDomainObj *vm, if (qemuBuildRedirdevCommandLine(cmd, def, qemuCaps) < 0) return NULL;
- if (qemuBuildHostdevCommandLine(cmd, def, qemuCaps) < 0) + if (qemuBuildHostdevCommandLine(cmd, def, qemuCaps, vm) < 0) return NULL;
if (migrateURI) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 7601bdbb2b..d569dd5ad9 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -2042,6 +2042,7 @@ qemuDomainObjPrivateAlloc(void *opaque) priv->blockjobs = virHashNew(virObjectUnref); priv->fds = virHashNew(g_object_unref);
+ priv->iommufd = -1; priv->pidMonitored = -1;
/* agent commands block by default, user can choose different behavior */ diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 4736f1ede5..e55ba1c968 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -264,6 +264,8 @@ struct _qemuDomainObjPrivate { /* named file descriptor groups associated with the VM */ GHashTable *fds;
+ int iommufd; + char *memoryBackingDir; };
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index bf245ee8af..83b8a586a1 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -10272,6 +10272,38 @@ qemuProcessHandleNbdkitExit(qemuNbdkitProcess *nbdkit, virObjectUnlock(vm); }
+/** + * qemuProcessOpenIommuFd: + * @vm: domain object + * @iommuFd: returned file descriptor + * + * Opens /dev/iommu file descriptor for the VM. + * + * Returns: 0 on success, -1 on failure + */ +static int +qemuProcessOpenIommuFd(virDomainObj *vm, int *iommuFd) +{ + int fd = -1; + + VIR_DEBUG("Opening IOMMU FD for domain %s", vm->def->name); + + if ((fd = open("/dev/iommu", O_RDWR | O_CLOEXEC)) < 0) { + if (errno == ENOENT) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("IOMMU FD support requires /dev/iommu device")); + } else { + virReportSystemError(errno, "%s", + _("cannot open /dev/iommu")); + } + return -1; + } + + *iommuFd = fd; + VIR_DEBUG("Opened IOMMU FD %d for domain %s", fd, vm->def->name); + return 0; +} + /** * qemuProcessOpenVfioDeviceFd: * @hostdev: host device definition @@ -10329,6 +10361,8 @@ qemuProcessOpenVfioDeviceFd(virDomainHostdevDef *hostdev, int qemuProcessOpenVfioFds(virDomainObj *vm) { + qemuDomainObjPrivate *priv = vm->privateData; + bool needsIommuFd = false; size_t i;
/* Check if we have any hostdevs that need VFIO FDs */ @@ -10342,6 +10376,8 @@ qemuProcessOpenVfioFds(virDomainObj *vm) if (hostdev->source.subsys.u.pci.driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO && hostdev->source.subsys.u.pci.driver.iommufd == VIR_TRISTATE_BOOL_YES) {
+ needsIommuFd = true; + if (!hostdev->privateData) { if (!(hostdev->privateData = qemuDomainHostdevPrivateNew())) goto error; @@ -10363,6 +10399,18 @@ qemuProcessOpenVfioFds(virDomainObj *vm) } }
+ /* Open IOMMU FD if needed */ + if (needsIommuFd) { + int iommuFd = -1; +
No need for the intermediary variable,
+ if (qemuProcessOpenIommuFd(vm, &iommuFd) < 0) + goto error; + + priv->iommufd = iommuFd; + + VIR_DEBUG("Stored IOMMU FD %d", priv->iommufd); + } + return 0;
error:
Everything below can be dropped Jano
@@ -10379,6 +10427,7 @@ qemuProcessOpenVfioFds(virDomainObj *vm) void qemuProcessCloseVfioFds(virDomainObj *vm) { + qemuDomainObjPrivate *priv = vm->privateData; size_t i;
/* Close all VFIO device FDs */ @@ -10396,4 +10445,11 @@ qemuProcessCloseVfioFds(virDomainObj *vm) VIR_FORCE_CLOSE(hostdevPriv->vfioDeviceFd); } } + + /* Close IOMMU FD */ + if (priv->iommufd >= 0) { + VIR_DEBUG("Closing IOMMU FD %d", priv->iommufd); + VIR_FORCE_CLOSE(priv->iommufd); + priv->iommufd = -1; + } } -- 2.43.0
On 12/8/2025 10:19 AM, Ján Tomko wrote:
On a Friday in 2025, Nathan Chen via Devel wrote:
Open iommufd FD from libvirt backend without exposing these FDs to XML users, i.e. one per domain for /dev/iommu, and pass the FD to qemu command line.
Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/qemu/qemu_command.c | 8 ++++-- src/qemu/qemu_domain.c | 1 + src/qemu/qemu_domain.h | 2 ++ src/qemu/qemu_process.c | 56 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 65 insertions(+), 2 deletions(-)
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 9b08f66175..99c310cf31 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -5229,12 +5229,14 @@ qemuBuildAcpiNodesetProps(virCommand *cmd, static int qemuBuildHostdevCommandLine(virCommand *cmd, const virDomainDef *def, - virQEMUCaps *qemuCaps) + virQEMUCaps *qemuCaps, + virDomainObj *vm) { size_t i; g_autoptr(virJSONValue) props = NULL; int iommufd = 0; const char * iommufdId = "iommufd0"; + qemuDomainObjPrivate *priv = vm->privateData;
for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevDef *hostdev = def->hostdevs[i]; @@ -5265,8 +5267,10 @@ qemuBuildHostdevCommandLine(virCommand *cmd,
if (subsys->u.pci.driver.iommufd == VIR_TRISTATE_BOOL_YES && iommufd == 0) { iommufd = 1; + virCommandPassFD(cmd, priv->iommufd, VIR_COMMAND_PASS_FD_CLOSE_PARENT); if (qemuMonitorCreateObjectProps(&props, "iommufd", iommufdId, + "S:fd", g_strdup_printf("%d", priv->iommufd),
Same comment about g_strdup_printf as with the other fd: formatting. Also, this change would be in a different function now.
Ok, I will fix this and separate it out from this function.
NULL) < 0) return -1;
@@ -10967,7 +10971,7 @@ qemuBuildCommandLine(virDomainObj *vm, if (qemuBuildRedirdevCommandLine(cmd, def, qemuCaps) < 0) return NULL;
- if (qemuBuildHostdevCommandLine(cmd, def, qemuCaps) < 0) + if (qemuBuildHostdevCommandLine(cmd, def, qemuCaps, vm) < 0) return NULL;
if (migrateURI) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 7601bdbb2b..d569dd5ad9 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -2042,6 +2042,7 @@ qemuDomainObjPrivateAlloc(void *opaque) priv->blockjobs = virHashNew(virObjectUnref); priv->fds = virHashNew(g_object_unref);
+ priv->iommufd = -1; priv->pidMonitored = -1;
/* agent commands block by default, user can choose different behavior */ diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 4736f1ede5..e55ba1c968 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -264,6 +264,8 @@ struct _qemuDomainObjPrivate { /* named file descriptor groups associated with the VM */ GHashTable *fds;
+ int iommufd; + char *memoryBackingDir; };
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index bf245ee8af..83b8a586a1 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -10272,6 +10272,38 @@ qemuProcessHandleNbdkitExit(qemuNbdkitProcess *nbdkit, virObjectUnlock(vm); }
+/** + * qemuProcessOpenIommuFd: + * @vm: domain object + * @iommuFd: returned file descriptor + * + * Opens /dev/iommu file descriptor for the VM. + * + * Returns: 0 on success, -1 on failure + */ +static int +qemuProcessOpenIommuFd(virDomainObj *vm, int *iommuFd) +{ + int fd = -1; + + VIR_DEBUG("Opening IOMMU FD for domain %s", vm->def->name); + + if ((fd = open("/dev/iommu", O_RDWR | O_CLOEXEC)) < 0) { + if (errno == ENOENT) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("IOMMU FD support requires /dev/iommu device")); + } else { + virReportSystemError(errno, "%s", + _("cannot open /dev/iommu")); + } + return -1; + } + + *iommuFd = fd; + VIR_DEBUG("Opened IOMMU FD %d for domain %s", fd, vm->def->name); + return 0; +} + /** * qemuProcessOpenVfioDeviceFd: * @hostdev: host device definition @@ -10329,6 +10361,8 @@ qemuProcessOpenVfioDeviceFd(virDomainHostdevDef *hostdev, int qemuProcessOpenVfioFds(virDomainObj *vm) { + qemuDomainObjPrivate *priv = vm->privateData; + bool needsIommuFd = false; size_t i;
/* Check if we have any hostdevs that need VFIO FDs */ @@ -10342,6 +10376,8 @@ qemuProcessOpenVfioFds(virDomainObj *vm) if (hostdev->source.subsys.u.pci.driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO && hostdev->source.subsys.u.pci.driver.iommufd == VIR_TRISTATE_BOOL_YES) {
+ needsIommuFd = true; + if (!hostdev->privateData) { if (!(hostdev->privateData = qemuDomainHostdevPrivateNew())) goto error; @@ -10363,6 +10399,18 @@ qemuProcessOpenVfioFds(virDomainObj *vm) } }
+ /* Open IOMMU FD if needed */ + if (needsIommuFd) { + int iommuFd = -1; +
No need for the intermediary variable,
Ok, I will remove it in the next revision.
+ if (qemuProcessOpenIommuFd(vm, &iommuFd) < 0) + goto error; + + priv->iommufd = iommuFd; + + VIR_DEBUG("Stored IOMMU FD %d", priv->iommufd); + } + return 0;
error:
Everything below can be dropped Got it, I will remove qemuProcessCloseVfioFds.
Thanks, Nathan
When launching a qemu VM with the iommufd feature enabled for VFIO hostdevs: - Do not allow access to /dev/vfio/vfio and /dev/vfio/<iommugroup> used by VFIO without iommufd enabled - Allow access to /dev/iommu and /dev/vfio/devices/vfio* Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/qemu/qemu_cgroup.c | 26 ++++++++++++++------------ src/qemu/qemu_namespace.c | 16 +++++++++------- src/security/security_apparmor.c | 18 +++++++++++------- src/security/security_dac.c | 28 ++++++++++++++++++---------- src/security/security_selinux.c | 28 ++++++++++++++++++---------- src/security/virt-aa-helper.c | 11 +++++++++-- 6 files changed, 79 insertions(+), 48 deletions(-) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 46a7dc1d8b..b3610b31ca 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -479,21 +479,23 @@ qemuSetupHostdevCgroup(virDomainObj *vm, g_autofree char *path = NULL; int perms; - if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_DEVICES)) - return 0; + if (dev->source.subsys.u.pci.driver.iommufd != VIR_TRISTATE_BOOL_YES) { + if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_DEVICES)) + return 0; - if (qemuDomainGetHostdevPath(dev, &path, &perms) < 0) - return -1; + if (qemuDomainGetHostdevPath(dev, &path, &perms) < 0) + return -1; - if (path && - qemuCgroupAllowDevicePath(vm, path, perms, false) < 0) { - return -1; - } + if (path && + qemuCgroupAllowDevicePath(vm, path, perms, false) < 0) { + return -1; + } - if (virHostdevNeedsVFIO(dev) && - qemuCgroupAllowDevicePath(vm, QEMU_DEV_VFIO, - VIR_CGROUP_DEVICE_RW, false) < 0) { - return -1; + if (virHostdevNeedsVFIO(dev) && + qemuCgroupAllowDevicePath(vm, QEMU_DEV_VFIO, + VIR_CGROUP_DEVICE_RW, false) < 0) { + return -1; + } } return 0; diff --git a/src/qemu/qemu_namespace.c b/src/qemu/qemu_namespace.c index 932777505b..489b13261b 100644 --- a/src/qemu/qemu_namespace.c +++ b/src/qemu/qemu_namespace.c @@ -343,15 +343,17 @@ qemuDomainSetupHostdev(virDomainObj *vm, { g_autofree char *path = NULL; - if (qemuDomainGetHostdevPath(hostdev, &path, NULL) < 0) - return -1; + if (hostdev->source.subsys.u.pci.driver.iommufd != VIR_TRISTATE_BOOL_YES) { + if (qemuDomainGetHostdevPath(hostdev, &path, NULL) < 0) + return -1; - if (path) - *paths = g_slist_prepend(*paths, g_steal_pointer(&path)); + if (path) + *paths = g_slist_prepend(*paths, g_steal_pointer(&path)); - if (virHostdevNeedsVFIO(hostdev) && - (!hotplug || !qemuDomainNeedsVFIO(vm->def))) - *paths = g_slist_prepend(*paths, g_strdup(QEMU_DEV_VFIO)); + if (virHostdevNeedsVFIO(hostdev) && + (!hotplug || !qemuDomainNeedsVFIO(vm->def))) + *paths = g_slist_prepend(*paths, g_strdup(QEMU_DEV_VFIO)); + } return 0; } diff --git a/src/security/security_apparmor.c b/src/security/security_apparmor.c index 68ac39611f..d66f035e52 100644 --- a/src/security/security_apparmor.c +++ b/src/security/security_apparmor.c @@ -848,14 +848,18 @@ AppArmorSetSecurityHostdevLabel(virSecurityManager *mgr, goto done; if (pcisrc->driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO) { - char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); - - if (!vfioGroupDev) { - virPCIDeviceFree(pci); - goto done; + if (dev->source.subsys.u.pci.driver.iommufd != VIR_TRISTATE_BOOL_YES) { + char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); + + if (!vfioGroupDev) { + virPCIDeviceFree(pci); + goto done; + } + ret = AppArmorSetSecurityPCILabel(pci, vfioGroupDev, ptr); + VIR_FREE(vfioGroupDev); + } else { + ret = 0; } - ret = AppArmorSetSecurityPCILabel(pci, vfioGroupDev, ptr); - VIR_FREE(vfioGroupDev); } else { ret = virPCIDeviceFileIterate(pci, AppArmorSetSecurityPCILabel, ptr); } diff --git a/src/security/security_dac.c b/src/security/security_dac.c index 2f788b872a..93a9268389 100644 --- a/src/security/security_dac.c +++ b/src/security/security_dac.c @@ -1282,14 +1282,18 @@ virSecurityDACSetHostdevLabel(virSecurityManager *mgr, return -1; if (pcisrc->driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO) { - g_autofree char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); + if (dev->source.subsys.u.pci.driver.iommufd != VIR_TRISTATE_BOOL_YES) { + g_autofree char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); - if (!vfioGroupDev) - return -1; + if (!vfioGroupDev) + return -1; - ret = virSecurityDACSetHostdevLabelHelper(vfioGroupDev, - false, - &cbdata); + ret = virSecurityDACSetHostdevLabelHelper(vfioGroupDev, + false, + &cbdata); + } else { + ret = 0; + } } else { ret = virPCIDeviceFileIterate(pci, virSecurityDACSetPCILabel, @@ -1443,13 +1447,17 @@ virSecurityDACRestoreHostdevLabel(virSecurityManager *mgr, return -1; if (pcisrc->driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO) { - g_autofree char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); + if (dev->source.subsys.u.pci.driver.iommufd != VIR_TRISTATE_BOOL_YES) { + g_autofree char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); - if (!vfioGroupDev) - return -1; + if (!vfioGroupDev) + return -1; - ret = virSecurityDACRestoreFileLabelInternal(mgr, NULL, + ret = virSecurityDACRestoreFileLabelInternal(mgr, NULL, vfioGroupDev, false); + } else { + ret = 0; + } } else { ret = virPCIDeviceFileIterate(pci, virSecurityDACRestorePCILabel, mgr); } diff --git a/src/security/security_selinux.c b/src/security/security_selinux.c index 2f3cc274a5..af6b938641 100644 --- a/src/security/security_selinux.c +++ b/src/security/security_selinux.c @@ -2256,14 +2256,18 @@ virSecuritySELinuxSetHostdevSubsysLabel(virSecurityManager *mgr, return -1; if (pcisrc->driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO) { - g_autofree char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); + if (dev->source.subsys.u.pci.driver.iommufd != VIR_TRISTATE_BOOL_YES) { + g_autofree char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); - if (!vfioGroupDev) - return -1; + if (!vfioGroupDev) + return -1; - ret = virSecuritySELinuxSetHostdevLabelHelper(vfioGroupDev, - false, - &data); + ret = virSecuritySELinuxSetHostdevLabelHelper(vfioGroupDev, + false, + &data); + } else { + ret = 0; + } } else { ret = virPCIDeviceFileIterate(pci, virSecuritySELinuxSetPCILabel, &data); } @@ -2491,12 +2495,16 @@ virSecuritySELinuxRestoreHostdevSubsysLabel(virSecurityManager *mgr, return -1; if (pcisrc->driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO) { - g_autofree char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); + if (dev->source.subsys.u.pci.driver.iommufd != VIR_TRISTATE_BOOL_YES) { + g_autofree char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); - if (!vfioGroupDev) - return -1; + if (!vfioGroupDev) + return -1; - ret = virSecuritySELinuxRestoreFileLabel(mgr, vfioGroupDev, false, false); + ret = virSecuritySELinuxRestoreFileLabel(mgr, vfioGroupDev, false, false); + } else { + ret = 0; + } } else { ret = virPCIDeviceFileIterate(pci, virSecuritySELinuxRestorePCILabel, mgr); } diff --git a/src/security/virt-aa-helper.c b/src/security/virt-aa-helper.c index de0a826063..ea05f2c5f7 100644 --- a/src/security/virt-aa-helper.c +++ b/src/security/virt-aa-helper.c @@ -878,7 +878,7 @@ get_files(vahControl * ctl) size_t i; g_autofree char *uuid = NULL; char uuidstr[VIR_UUID_STRING_BUFLEN]; - bool needsVfio = false, needsvhost = false, needsgl = false; + bool needsVfio = false, needsvhost = false, needsgl = false, needsIommufd = false; /* verify uuid is same as what we were given on the command line */ virUUIDFormat(ctl->def->uuid, uuidstr); @@ -1119,6 +1119,9 @@ get_files(vahControl * ctl) needsVfio = true; } + if (dev->source.subsys.u.pci.driver.iommufd == VIR_TRISTATE_BOOL_YES) + needsIommufd = true; + if (pci == NULL) continue; @@ -1344,10 +1347,14 @@ get_files(vahControl * ctl) if (needsvhost) virBufferAddLit(&buf, " \"/dev/vhost-net\" rw,\n"); - if (needsVfio) { + if (needsIommufd) { + virBufferAddLit(&buf, " \"/dev/iommu\" rwm,\n"); + virBufferAddLit(&buf, " \"/dev/vfio/devices/vfio[0-9]*\" rwm,\n"); + } else if (needsVfio) { virBufferAddLit(&buf, " \"/dev/vfio/vfio\" rw,\n"); virBufferAddLit(&buf, " \"/dev/vfio/[0-9]*\" rw,\n"); } + if (needsgl) { /* if using gl all sorts of further dri related paths will be needed */ virBufferAddLit(&buf, " # DRI/Mesa/(e)GL config and driver paths\n"); -- 2.43.0
On a Friday in 2025, Nathan Chen via Devel wrote:
When launching a qemu VM with the iommufd feature enabled for VFIO hostdevs: - Do not allow access to /dev/vfio/vfio and /dev/vfio/<iommugroup> used by VFIO without iommufd enabled - Allow access to /dev/iommu and /dev/vfio/devices/vfio*
The commit summary mentions cgroups, namespaces and seclabels, however I can only see this patch not allowing stuff needed for legacy VFIO, but I don't see it allowing the new paths. I see you add the paths to apparmor - presumably you're using a Debian-based distro that doesn't have SELinux. But are cgroups not used either? Possibly namespaces aren't necessary if we're passing the FDs and I'll look into the SELinux stuff. Jano
Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/qemu/qemu_cgroup.c | 26 ++++++++++++++------------ src/qemu/qemu_namespace.c | 16 +++++++++------- src/security/security_apparmor.c | 18 +++++++++++------- src/security/security_dac.c | 28 ++++++++++++++++++---------- src/security/security_selinux.c | 28 ++++++++++++++++++---------- src/security/virt-aa-helper.c | 11 +++++++++-- 6 files changed, 79 insertions(+), 48 deletions(-)
diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index 46a7dc1d8b..b3610b31ca 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -479,21 +479,23 @@ qemuSetupHostdevCgroup(virDomainObj *vm, g_autofree char *path = NULL; int perms;
- if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_DEVICES)) - return 0; + if (dev->source.subsys.u.pci.driver.iommufd != VIR_TRISTATE_BOOL_YES) { + if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_DEVICES)) + return 0;
- if (qemuDomainGetHostdevPath(dev, &path, &perms) < 0) - return -1; + if (qemuDomainGetHostdevPath(dev, &path, &perms) < 0) + return -1;
- if (path && - qemuCgroupAllowDevicePath(vm, path, perms, false) < 0) { - return -1; - } + if (path && + qemuCgroupAllowDevicePath(vm, path, perms, false) < 0) { + return -1; + }
- if (virHostdevNeedsVFIO(dev) && - qemuCgroupAllowDevicePath(vm, QEMU_DEV_VFIO, - VIR_CGROUP_DEVICE_RW, false) < 0) { - return -1; + if (virHostdevNeedsVFIO(dev) && + qemuCgroupAllowDevicePath(vm, QEMU_DEV_VFIO, + VIR_CGROUP_DEVICE_RW, false) < 0) { + return -1; + } }
return 0; diff --git a/src/qemu/qemu_namespace.c b/src/qemu/qemu_namespace.c index 932777505b..489b13261b 100644 --- a/src/qemu/qemu_namespace.c +++ b/src/qemu/qemu_namespace.c @@ -343,15 +343,17 @@ qemuDomainSetupHostdev(virDomainObj *vm, { g_autofree char *path = NULL;
- if (qemuDomainGetHostdevPath(hostdev, &path, NULL) < 0) - return -1; + if (hostdev->source.subsys.u.pci.driver.iommufd != VIR_TRISTATE_BOOL_YES) { + if (qemuDomainGetHostdevPath(hostdev, &path, NULL) < 0) + return -1;
- if (path) - *paths = g_slist_prepend(*paths, g_steal_pointer(&path)); + if (path) + *paths = g_slist_prepend(*paths, g_steal_pointer(&path));
- if (virHostdevNeedsVFIO(hostdev) && - (!hotplug || !qemuDomainNeedsVFIO(vm->def))) - *paths = g_slist_prepend(*paths, g_strdup(QEMU_DEV_VFIO)); + if (virHostdevNeedsVFIO(hostdev) && + (!hotplug || !qemuDomainNeedsVFIO(vm->def))) + *paths = g_slist_prepend(*paths, g_strdup(QEMU_DEV_VFIO)); + }
return 0; } diff --git a/src/security/security_apparmor.c b/src/security/security_apparmor.c index 68ac39611f..d66f035e52 100644 --- a/src/security/security_apparmor.c +++ b/src/security/security_apparmor.c @@ -848,14 +848,18 @@ AppArmorSetSecurityHostdevLabel(virSecurityManager *mgr, goto done;
if (pcisrc->driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO) { - char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); - - if (!vfioGroupDev) { - virPCIDeviceFree(pci); - goto done; + if (dev->source.subsys.u.pci.driver.iommufd != VIR_TRISTATE_BOOL_YES) { + char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); + + if (!vfioGroupDev) { + virPCIDeviceFree(pci); + goto done; + } + ret = AppArmorSetSecurityPCILabel(pci, vfioGroupDev, ptr); + VIR_FREE(vfioGroupDev); + } else { + ret = 0; } - ret = AppArmorSetSecurityPCILabel(pci, vfioGroupDev, ptr); - VIR_FREE(vfioGroupDev); } else { ret = virPCIDeviceFileIterate(pci, AppArmorSetSecurityPCILabel, ptr); } diff --git a/src/security/security_dac.c b/src/security/security_dac.c index 2f788b872a..93a9268389 100644 --- a/src/security/security_dac.c +++ b/src/security/security_dac.c @@ -1282,14 +1282,18 @@ virSecurityDACSetHostdevLabel(virSecurityManager *mgr, return -1;
if (pcisrc->driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO) { - g_autofree char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); + if (dev->source.subsys.u.pci.driver.iommufd != VIR_TRISTATE_BOOL_YES) { + g_autofree char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci);
- if (!vfioGroupDev) - return -1; + if (!vfioGroupDev) + return -1;
- ret = virSecurityDACSetHostdevLabelHelper(vfioGroupDev, - false, - &cbdata); + ret = virSecurityDACSetHostdevLabelHelper(vfioGroupDev, + false, + &cbdata); + } else { + ret = 0; + } } else { ret = virPCIDeviceFileIterate(pci, virSecurityDACSetPCILabel, @@ -1443,13 +1447,17 @@ virSecurityDACRestoreHostdevLabel(virSecurityManager *mgr, return -1;
if (pcisrc->driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO) { - g_autofree char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); + if (dev->source.subsys.u.pci.driver.iommufd != VIR_TRISTATE_BOOL_YES) { + g_autofree char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci);
- if (!vfioGroupDev) - return -1; + if (!vfioGroupDev) + return -1;
- ret = virSecurityDACRestoreFileLabelInternal(mgr, NULL, + ret = virSecurityDACRestoreFileLabelInternal(mgr, NULL, vfioGroupDev, false); + } else { + ret = 0; + } } else { ret = virPCIDeviceFileIterate(pci, virSecurityDACRestorePCILabel, mgr); } diff --git a/src/security/security_selinux.c b/src/security/security_selinux.c index 2f3cc274a5..af6b938641 100644 --- a/src/security/security_selinux.c +++ b/src/security/security_selinux.c @@ -2256,14 +2256,18 @@ virSecuritySELinuxSetHostdevSubsysLabel(virSecurityManager *mgr, return -1;
if (pcisrc->driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO) { - g_autofree char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); + if (dev->source.subsys.u.pci.driver.iommufd != VIR_TRISTATE_BOOL_YES) { + g_autofree char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci);
- if (!vfioGroupDev) - return -1; + if (!vfioGroupDev) + return -1;
- ret = virSecuritySELinuxSetHostdevLabelHelper(vfioGroupDev, - false, - &data); + ret = virSecuritySELinuxSetHostdevLabelHelper(vfioGroupDev, + false, + &data); + } else { + ret = 0; + } } else { ret = virPCIDeviceFileIterate(pci, virSecuritySELinuxSetPCILabel, &data); } @@ -2491,12 +2495,16 @@ virSecuritySELinuxRestoreHostdevSubsysLabel(virSecurityManager *mgr, return -1;
if (pcisrc->driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO) { - g_autofree char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci); + if (dev->source.subsys.u.pci.driver.iommufd != VIR_TRISTATE_BOOL_YES) { + g_autofree char *vfioGroupDev = virPCIDeviceGetIOMMUGroupDev(pci);
- if (!vfioGroupDev) - return -1; + if (!vfioGroupDev) + return -1;
- ret = virSecuritySELinuxRestoreFileLabel(mgr, vfioGroupDev, false, false); + ret = virSecuritySELinuxRestoreFileLabel(mgr, vfioGroupDev, false, false); + } else { + ret = 0; + } } else { ret = virPCIDeviceFileIterate(pci, virSecuritySELinuxRestorePCILabel, mgr); } diff --git a/src/security/virt-aa-helper.c b/src/security/virt-aa-helper.c index de0a826063..ea05f2c5f7 100644 --- a/src/security/virt-aa-helper.c +++ b/src/security/virt-aa-helper.c @@ -878,7 +878,7 @@ get_files(vahControl * ctl) size_t i; g_autofree char *uuid = NULL; char uuidstr[VIR_UUID_STRING_BUFLEN]; - bool needsVfio = false, needsvhost = false, needsgl = false; + bool needsVfio = false, needsvhost = false, needsgl = false, needsIommufd = false;
/* verify uuid is same as what we were given on the command line */ virUUIDFormat(ctl->def->uuid, uuidstr); @@ -1119,6 +1119,9 @@ get_files(vahControl * ctl) needsVfio = true; }
+ if (dev->source.subsys.u.pci.driver.iommufd == VIR_TRISTATE_BOOL_YES) + needsIommufd = true; + if (pci == NULL) continue;
@@ -1344,10 +1347,14 @@ get_files(vahControl * ctl) if (needsvhost) virBufferAddLit(&buf, " \"/dev/vhost-net\" rw,\n");
- if (needsVfio) { + if (needsIommufd) { + virBufferAddLit(&buf, " \"/dev/iommu\" rwm,\n"); + virBufferAddLit(&buf, " \"/dev/vfio/devices/vfio[0-9]*\" rwm,\n"); + } else if (needsVfio) { virBufferAddLit(&buf, " \"/dev/vfio/vfio\" rw,\n"); virBufferAddLit(&buf, " \"/dev/vfio/[0-9]*\" rw,\n"); } + if (needsgl) { /* if using gl all sorts of further dri related paths will be needed */ virBufferAddLit(&buf, " # DRI/Mesa/(e)GL config and driver paths\n"); -- 2.43.0
On 12/8/2025 10:22 AM, Ján Tomko wrote:
On a Friday in 2025, Nathan Chen via Devel wrote:
When launching a qemu VM with the iommufd feature enabled for VFIO hostdevs: - Do not allow access to /dev/vfio/vfio and /dev/vfio/<iommugroup> used by VFIO without iommufd enabled - Allow access to /dev/iommu and /dev/vfio/devices/vfio*
The commit summary mentions cgroups, namespaces and seclabels, however I can only see this patch not allowing stuff needed for legacy VFIO, but I don't see it allowing the new paths.
I see you add the paths to apparmor - presumably you're using a Debian-based distro that doesn't have SELinux.
But are cgroups not used either?
Possibly namespaces aren't necessary if we're passing the FDs and I'll look into the SELinux stuff.
From my testing, allowing the iommufd paths in cgroups and namespaces is no longer necessary since we are passing the FDs. I will update the commit description to be more specific, i.e. - Do not allow cgroup, namespace, and seclabel access to VFIO paths (/dev/vfio/vfio and /dev/vfio/<iommugroup>) if iommufd is enabled - Allow access to iommufd paths (/dev/iommu and /dev/vfio/devices/vfio*) in AppArmor and SELinux if iommufd is enabled Thanks, Nathan
Provide sample XML and CLI args for the iommufd XML schema for pc, q35, and virt machine types. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- .../iommufd-q35.x86_64-latest.args | 41 +++++++++++++ .../iommufd-q35.x86_64-latest.xml | 60 +++++++++++++++++++ tests/qemuxmlconfdata/iommufd-q35.xml | 38 ++++++++++++ .../iommufd-virt.aarch64-latest.args | 33 ++++++++++ .../iommufd-virt.aarch64-latest.xml | 34 +++++++++++ tests/qemuxmlconfdata/iommufd-virt.xml | 22 +++++++ .../iommufd.x86_64-latest.args | 35 +++++++++++ .../qemuxmlconfdata/iommufd.x86_64-latest.xml | 38 ++++++++++++ tests/qemuxmlconfdata/iommufd.xml | 30 ++++++++++ tests/qemuxmlconftest.c | 33 ++++++++++ 10 files changed, 364 insertions(+) create mode 100644 tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.args create mode 100644 tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.xml create mode 100644 tests/qemuxmlconfdata/iommufd-q35.xml create mode 100644 tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.args create mode 100644 tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.xml create mode 100644 tests/qemuxmlconfdata/iommufd-virt.xml create mode 100644 tests/qemuxmlconfdata/iommufd.x86_64-latest.args create mode 100644 tests/qemuxmlconfdata/iommufd.x86_64-latest.xml create mode 100644 tests/qemuxmlconfdata/iommufd.xml diff --git a/tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.args b/tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.args new file mode 100644 index 0000000000..7d819e141b --- /dev/null +++ b/tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.args @@ -0,0 +1,41 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/var/lib/libvirt/qemu/domain--1-q35-test \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/var/lib/libvirt/qemu/domain--1-q35-test/.local/share \ +XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain--1-q35-test/.cache \ +XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain--1-q35-test/.config \ +/usr/bin/qemu-system-x86_64 \ +-name guest=q35-test,debug-threads=on \ +-S \ +-object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain--1-q35-test/master-key.aes"}' \ +-machine q35,usb=off,dump-guest-core=off,memory-backend=pc.ram,acpi=off \ +-accel tcg \ +-cpu qemu64 \ +-m size=2097152k \ +-object '{"qom-type":"memory-backend-ram","id":"pc.ram","size":2147483648}' \ +-overcommit mem-lock=off \ +-smp 2,sockets=2,cores=1,threads=1 \ +-uuid 11dbdcdd-4c3b-482b-8903-9bdb8c0a2774 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,fd=1729,server=on,wait=off \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-boot strict=on \ +-device '{"driver":"pcie-root-port","port":16,"chassis":1,"id":"pci.1","bus":"pcie.0","multifunction":true,"addr":"0x2"}' \ +-device '{"driver":"pcie-root-port","port":17,"chassis":2,"id":"pci.2","bus":"pcie.0","addr":"0x2.0x1"}' \ +-device '{"driver":"qemu-xhci","id":"usb","bus":"pci.1","addr":"0x0"}' \ +-blockdev '{"driver":"host_device","filename":"/dev/HostVG/QEMUGuest1","node-name":"libvirt-1-storage","read-only":false}' \ +-device '{"driver":"ide-hd","bus":"ide.0","drive":"libvirt-1-storage","id":"sata0-0-0","bootindex":1}' \ +-audiodev '{"id":"audio1","driver":"none"}' \ +-device '{"driver":"qxl-vga","id":"video0","max_outputs":1,"ram_size":67108864,"vram_size":33554432,"vram64_size_mb":0,"vgamem_mb":8,"bus":"pcie.0","addr":"0x1"}' \ +-global ICH9-LPC.noreboot=off \ +-watchdog-action reset \ +-object '{"qom-type":"iommufd","id":"iommufd0","fd":"-1"}' \ +-device '{"driver":"vfio-pci","host":"0000:06:12.5","id":"hostdev0","iommufd":"iommufd0","fd":"0","bus":"pcie.0","addr":"0x3"}' \ +-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ +-msg timestamp=on diff --git a/tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.xml b/tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.xml new file mode 100644 index 0000000000..bb76252b61 --- /dev/null +++ b/tests/qemuxmlconfdata/iommufd-q35.x86_64-latest.xml @@ -0,0 +1,60 @@ +<domain type='qemu'> + <name>q35-test</name> + <uuid>11dbdcdd-4c3b-482b-8903-9bdb8c0a2774</uuid> + <memory unit='KiB'>2097152</memory> + <currentMemory unit='KiB'>2097152</currentMemory> + <vcpu placement='static' cpuset='0-1'>2</vcpu> + <os> + <type arch='x86_64' machine='q35'>hvm</type> + <boot dev='hd'/> + </os> + <cpu mode='custom' match='exact' check='none'> + <model fallback='forbid'>qemu64</model> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <disk type='block' device='disk'> + <driver name='qemu' type='raw'/> + <source dev='/dev/HostVG/QEMUGuest1'/> + <target dev='sda' bus='sata'/> + <address type='drive' controller='0' bus='0' target='0' unit='0'/> + </disk> + <controller type='pci' index='0' model='pcie-root'/> + <controller type='pci' index='1' model='pcie-root-port'> + <model name='pcie-root-port'/> + <target chassis='1' port='0x10'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/> + </controller> + <controller type='pci' index='2' model='pcie-root-port'> + <model name='pcie-root-port'/> + <target chassis='2' port='0x11'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/> + </controller> + <controller type='sata' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/> + </controller> + <controller type='usb' index='0' model='qemu-xhci'> + <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> + </controller> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <audio id='1' type='none'/> + <video> + <model type='qxl' ram='65536' vram='32768' vgamem='8192' heads='1' primary='yes'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> + </video> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver iommufd='yes'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x5'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </hostdev> + <watchdog model='itco' action='reset'/> + <memballoon model='none'/> + </devices> +</domain> diff --git a/tests/qemuxmlconfdata/iommufd-q35.xml b/tests/qemuxmlconfdata/iommufd-q35.xml new file mode 100644 index 0000000000..f3c2269fb1 --- /dev/null +++ b/tests/qemuxmlconfdata/iommufd-q35.xml @@ -0,0 +1,38 @@ +<domain type='qemu'> + <name>q35-test</name> + <uuid>11dbdcdd-4c3b-482b-8903-9bdb8c0a2774</uuid> + <memory unit='KiB'>2097152</memory> + <currentMemory unit='KiB'>2097152</currentMemory> + <vcpu placement='static' cpuset='0-1'>2</vcpu> + <os> + <type arch='x86_64' machine='q35'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <disk type='block' device='disk'> + <source dev='/dev/HostVG/QEMUGuest1'/> + <target dev='sda' bus='sata'/> + <address type='drive' controller='0' bus='0' target='0' unit='0'/> + </disk> + <controller type='pci' index='0' model='pcie-root'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver iommufd='yes'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x5'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </hostdev> + <controller type='sata' index='0'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <video> + <model type='qxl' ram='65536' vram='32768' vgamem='8192' heads='1'/> + </video> + <memballoon model='none'/> + </devices> +</domain> diff --git a/tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.args b/tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.args new file mode 100644 index 0000000000..dbfd395168 --- /dev/null +++ b/tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.args @@ -0,0 +1,33 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/var/lib/libvirt/qemu/domain--1-foo \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/var/lib/libvirt/qemu/domain--1-foo/.local/share \ +XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain--1-foo/.cache \ +XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain--1-foo/.config \ +/usr/bin/qemu-system-aarch64 \ +-name guest=foo,debug-threads=on \ +-S \ +-object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain--1-foo/master-key.aes"}' \ +-machine virt,usb=off,gic-version=2,dump-guest-core=off,memory-backend=mach-virt.ram,acpi=off \ +-accel tcg \ +-cpu cortex-a15 \ +-m size=1048576k \ +-object '{"qom-type":"memory-backend-ram","id":"mach-virt.ram","size":1073741824}' \ +-overcommit mem-lock=off \ +-smp 1,sockets=1,cores=1,threads=1 \ +-uuid 6ba7b810-9dad-11d1-80b4-00c04fd430c8 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,fd=1729,server=on,wait=off \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-boot strict=on \ +-audiodev '{"id":"audio1","driver":"none"}' \ +-object '{"qom-type":"iommufd","id":"iommufd0","fd":"-1"}' \ +-device '{"driver":"vfio-pci","host":"0000:06:12.5","id":"hostdev0","iommufd":"iommufd0","fd":"0","bus":"pcie.0","addr":"0x1"}' \ +-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ +-msg timestamp=on diff --git a/tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.xml b/tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.xml new file mode 100644 index 0000000000..97b6e1e1c7 --- /dev/null +++ b/tests/qemuxmlconfdata/iommufd-virt.aarch64-latest.xml @@ -0,0 +1,34 @@ +<domain type='qemu'> + <name>foo</name> + <uuid>6ba7b810-9dad-11d1-80b4-00c04fd430c8</uuid> + <memory unit='KiB'>1048576</memory> + <currentMemory unit='KiB'>1048576</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='aarch64' machine='virt'>hvm</type> + <boot dev='hd'/> + </os> + <features> + <gic version='2'/> + </features> + <cpu mode='custom' match='exact' check='none'> + <model fallback='forbid'>cortex-a15</model> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-aarch64</emulator> + <controller type='pci' index='0' model='pcie-root'/> + <audio id='1' type='none'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver iommufd='yes'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x5'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> + </hostdev> + <memballoon model='none'/> + </devices> +</domain> diff --git a/tests/qemuxmlconfdata/iommufd-virt.xml b/tests/qemuxmlconfdata/iommufd-virt.xml new file mode 100644 index 0000000000..c0b9d643b4 --- /dev/null +++ b/tests/qemuxmlconfdata/iommufd-virt.xml @@ -0,0 +1,22 @@ +<domain type='qemu'> + <name>foo</name> + <uuid>6ba7b810-9dad-11d1-80b4-00c04fd430c8</uuid> + <memory unit='KiB'>1048576</memory> + <currentMemory unit='KiB'>1048576</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='aarch64' machine='virt'>hvm</type> + </os> + <devices> + <emulator>/usr/bin/qemu-system-aarch64</emulator> + <controller type='pci' index='0' model='pcie-root'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver iommufd='yes'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x5'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> + </hostdev> + <memballoon model='none'/> + </devices> +</domain> diff --git a/tests/qemuxmlconfdata/iommufd.x86_64-latest.args b/tests/qemuxmlconfdata/iommufd.x86_64-latest.args new file mode 100644 index 0000000000..3130ba2e3a --- /dev/null +++ b/tests/qemuxmlconfdata/iommufd.x86_64-latest.args @@ -0,0 +1,35 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/var/lib/libvirt/qemu/domain--1-foo \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/var/lib/libvirt/qemu/domain--1-foo/.local/share \ +XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain--1-foo/.cache \ +XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain--1-foo/.config \ +/usr/bin/qemu-system-x86_64 \ +-name guest=foo,debug-threads=on \ +-S \ +-object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain--1-foo/master-key.aes"}' \ +-machine pc,usb=off,dump-guest-core=off,memory-backend=pc.ram,acpi=off \ +-accel tcg \ +-cpu qemu64 \ +-m size=2097152k \ +-object '{"qom-type":"memory-backend-ram","id":"pc.ram","size":2147483648}' \ +-overcommit mem-lock=off \ +-smp 2,sockets=2,cores=1,threads=1 \ +-uuid 3c7c30b5-7866-4b05-8a29-efebccba52a0 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,fd=1729,server=on,wait=off \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-boot strict=on \ +-device '{"driver":"piix3-usb-uhci","id":"usb","bus":"pci.0","addr":"0x1.0x2"}' \ +-audiodev '{"id":"audio1","driver":"none"}' \ +-object '{"qom-type":"iommufd","id":"iommufd0","fd":"-1"}' \ +-device '{"driver":"vfio-pci","host":"0000:06:12.5","id":"hostdev0","iommufd":"iommufd0","fd":"0","bus":"pci.0","addr":"0x3"}' \ +-device '{"driver":"virtio-balloon-pci","id":"balloon0","bus":"pci.0","addr":"0x2"}' \ +-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ +-msg timestamp=on diff --git a/tests/qemuxmlconfdata/iommufd.x86_64-latest.xml b/tests/qemuxmlconfdata/iommufd.x86_64-latest.xml new file mode 100644 index 0000000000..2e8951aaf6 --- /dev/null +++ b/tests/qemuxmlconfdata/iommufd.x86_64-latest.xml @@ -0,0 +1,38 @@ +<domain type='qemu'> + <name>foo</name> + <uuid>3c7c30b5-7866-4b05-8a29-efebccba52a0</uuid> + <memory unit='KiB'>2097152</memory> + <currentMemory unit='KiB'>2097152</currentMemory> + <vcpu placement='static' cpuset='0-1'>2</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <cpu mode='custom' match='exact' check='none'> + <model fallback='forbid'>qemu64</model> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='pci' index='0' model='pci-root'/> + <controller type='usb' index='0' model='piix3-uhci'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> + </controller> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <audio id='1' type='none'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver iommufd='yes'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x5'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </hostdev> + <memballoon model='virtio'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> + </memballoon> + </devices> +</domain> diff --git a/tests/qemuxmlconfdata/iommufd.xml b/tests/qemuxmlconfdata/iommufd.xml new file mode 100644 index 0000000000..eb278414d2 --- /dev/null +++ b/tests/qemuxmlconfdata/iommufd.xml @@ -0,0 +1,30 @@ +<domain type='qemu'> + <name>foo</name> + <uuid>3c7c30b5-7866-4b05-8a29-efebccba52a0</uuid> + <memory unit='KiB'>2097152</memory> + <currentMemory unit='KiB'>2097152</currentMemory> + <vcpu placement='static' cpuset='0-1'>2</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='pci' index='0' model='pci-root'/> + <hostdev mode='subsystem' type='pci' managed='yes'> + <driver iommufd='yes'/> + <source> + <address domain='0x0000' bus='0x06' slot='0x12' function='0x5'/> + </source> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </hostdev> + <controller type='usb' index='0'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxmlconftest.c b/tests/qemuxmlconftest.c index 36a752ad2b..a6db64b759 100644 --- a/tests/qemuxmlconftest.c +++ b/tests/qemuxmlconftest.c @@ -351,6 +351,33 @@ fakeNetworkPortGetXMLDesc(virNetworkPortPtr port, } +static void +testSetupHostdevPrivateData(virDomainDef *def) +{ + size_t i; + + for (i = 0; i < def->nhostdevs; i++) { + virDomainHostdevDef *hostdev = def->hostdevs[i]; + + if (hostdev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && + hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && + hostdev->source.subsys.u.pci.driver.name == VIR_DEVICE_HOSTDEV_PCI_DRIVER_NAME_VFIO && + hostdev->source.subsys.u.pci.driver.iommufd == VIR_TRISTATE_BOOL_YES) { + + qemuDomainHostdevPrivate *priv; + + if (!hostdev->privateData) { + hostdev->privateData = qemuDomainHostdevPrivateNew(); + } + + priv = QEMU_DOMAIN_HOSTDEV_PRIVATE(hostdev); + /* Use a placeholder FD value for tests */ + priv->vfioDeviceFd = 0; + } + } +} + + static virNetworkDriver fakeNetworkDriver = { .networkLookupByName = fakeNetworkLookupByName, .networkGetXMLDesc = fakeNetworkGetXMLDesc, @@ -404,6 +431,8 @@ testCompareXMLToArgvCreateArgs(virQEMUDriver *drv, if (testQemuPrepareHostBackendChardevOne(NULL, priv->monConfig, vm) < 0) return NULL; + testSetupHostdevPrivateData(vm->def); + for (i = 0; i < vm->def->ndisks; i++) { virDomainDiskDef *disk = vm->def->disks[i]; virStorageSource *src; @@ -3047,6 +3076,10 @@ mymain(void) DO_TEST_CAPS_LATEST_PARSE_ERROR("virtio-iommu-dma-translation"); DO_TEST_CAPS_LATEST("acpi-generic-initiator"); + DO_TEST_CAPS_LATEST("iommufd"); + DO_TEST_CAPS_LATEST("iommufd-q35"); + DO_TEST_CAPS_ARCH_LATEST("iommufd-virt", "aarch64"); + DO_TEST_CAPS_LATEST("cpu-hotplug-startup"); DO_TEST_CAPS_ARCH_LATEST_PARSE_ERROR("cpu-hotplug-granularity", "ppc64"); -- 2.43.0
participants (2)
-
Ján Tomko -
Nathan Chen