[RFC PATCH 0/5] qemu: Route hostdevs to multiple nested SMMUs

Hi, This is a draft solution for supporting multiple vSMMU instances in a qemu VM. Based on discussions/suggestions received for a previous RFC by Nicolin here[0], the association of vSMMUs to VFIO devices in VM PCIe topology should be moved out of qemu into libvirt. In addition, the nested SMMU nodes should be passed to qemu as pluggable devices. To address these changes, this patch series introduces a new "nestedSmmuv3" IOMMU model and "nestedSmmuv3" device type. Upon specifying the nestedSmmuv3 IOMMU model, nestedSmmuv3 devices will be auto-added to the VM definition based on the available SMMU nodes in the host's sysfs. The nestedSmmuv3 devices will each be attached to a separate PXB controller, and VFIO devices will be routed to PXBs based on their association with host SMMU nodes. This will maintain a VM PCIe topology that allows for multiple nested SMMUs per Nicolin's original qemu patch series in [0] and Shameer's work in [1] to remove VM topology changes from qemu and allow the nested SMMUs to be specified as pluggable devices. For instance, if we specify the nestedSmmuv3 IOMMU model and a hostdev for passthrough: <devices> <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> </hostdev> <iommu model='nestedSmmuv3'/> </devices> Libvirt will scan sysfs and populate the VM definition with controllers and nestedSmmuv3 devices based on host config. So if /sys/bus/pci/devices/0009:01:00.0/iommu is a symlink to the host SMMU node represented by /sys/devices/platform/arm-smmu-v3.8.auto/iommu/smmu3.0x0000000016000000 and there are 3 host SMMU nodes under /sys/class/iommu/, we'll see three auto-added nestedSmmuv3 devices, each routed to a pcie-expander-bus controller. Then the hostdev will be routed to a PXB controller that has a matching host SMMU node associated with it: <devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='254'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='2' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='251'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> <controller type='pci' index='3' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='249'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </controller> <controller type='pci' index='4' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='7' port='0x8'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/> </controller> <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </hostdev> <iommu model='nestedSmmuv3'/> <nestedSmmuv3> <name>smmu3.0x0000000012000000</name> <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </nestedSmmuv3> <nestedSmmuv3> <name>smmu3.0x0000000016000000</name> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </nestedSmmuv3> <nestedSmmuv3> <name>smmu3.0x0000000011000000</name> <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/> </nestedSmmuv3> <iommu model='nestedSmmuv3'/> </devices> TODO: - No DMA mapping can found by UEFI when specifying multiple passthrough devices in the VM definition, and VM boot is subsequently blocked. We need to investigate this for the next revision, but we don't encounter this issue when passing through a single device. We'll include iommufd support in the next revision to narrow down whether the required fix would be outside of libvirt. - Shameer's qemu branch specifies nestedSmmuv3 bus number with "pci-bus" instead of "bus", so the libvirt compilation test args and qemu args in qemuBuildPCINestedSmmuv3DevProps() need to be modified to match this revision of qemu. It will be reverted to using "bus" in the next qemu revision. - This patchset decrements PXB busNr based on how many devices are attached downstream, and the libvirt documentation states we must reserve busNr for the PXB itself in addition to any devices attached downstream. When I launch a VM and a PXB has a pcie-root-port and hostdev attached downstream, busNrs 253, 252, and 251 are reserved. But the PXB itself already has a bus number assigned via the <address/> attribute, and I see 253 and 252 assigned to the hostdev and pcie-root-port in the VM but not 251. Should we decrement busNr based on libvirt documentation or do we only need two busNrs 253 and 252 in the example here? This series is on Github: https://github.com/NathanChenNVIDIA/libvirt/tree/nested-smmuv3-12-05-24 Thanks, Nathan [0] https://lore.kernel.org/qemu-devel/cover.1719361174.git.nicolinc@nvidia.com/ [1] https://lore.kernel.org/qemu-devel/20241108125242.60136-1-shameerali.kolothu... Signed-off-by: Nathan Chen <nathanc@nvidia.com> Nathan Chen (5): conf: Add a nestedSmmuv3 IOMMU model qemu: Implement and auto-add a nestedSmmuv3 device type qemu: Create PXBs and auto-assign VFIO devs and nested SMMUs qemu: Update PXB busNr for nestedSmmuv3 controllers qemu: Add test case for specifying multiple nested SMMUs docs/formatdomain.rst | 25 ++- src/ch/ch_domain.c | 1 + src/conf/domain_addr.c | 26 ++- src/conf/domain_addr.h | 4 +- src/conf/domain_conf.c | 188 +++++++++++++++++ src/conf/domain_conf.h | 15 ++ src/conf/domain_postparse.c | 1 + src/conf/domain_validate.c | 24 +++ src/conf/schemas/domaincommon.rng | 17 ++ src/conf/virconftypes.h | 2 + src/libvirt_private.syms | 2 + src/lxc/lxc_driver.c | 6 + src/qemu/qemu_command.c | 64 +++++- src/qemu/qemu_command.h | 4 + src/qemu/qemu_domain.c | 2 + src/qemu/qemu_domain_address.c | 193 ++++++++++++++++++ src/qemu/qemu_driver.c | 3 + src/qemu/qemu_hotplug.c | 5 + src/qemu/qemu_postparse.c | 1 + src/qemu/qemu_validate.c | 16 ++ src/test/test_driver.c | 4 + tests/meson.build | 1 + .../iommu-nestedsmmuv3.aarch64-latest.args | 38 ++++ .../iommu-nestedsmmuv3.aarch64-latest.xml | 61 ++++++ tests/qemuxmlconfdata/iommu-nestedsmmuv3.xml | 29 +++ tests/qemuxmlconftest.c | 4 +- tests/schemas/device.rng.in | 1 + tests/virnestedsmmuv3mock.c | 57 ++++++ 28 files changed, 788 insertions(+), 6 deletions(-) create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.args create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.xml create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.xml create mode 100644 tests/virnestedsmmuv3mock.c -- 2.34.1

Add support for specifying "nestedSmmuv3" as the IOMMU model. In the following commits, when the "nestedSmmuv3" IOMMU model is parsed from the VM definition, PXB controllers and "nestedSmmuv3" devices will be auto-added to the VM definition and VFIO devices will be routed to PXB controllers based on their association with host SMMU nodes. The auto-added "nestedSmmuv3" devices will each be attached to a PXB controller so that VFIO devices with unique host SMMU node associations are attached to unique PXB controllers. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- docs/formatdomain.rst | 5 +++-- src/conf/domain_conf.c | 1 + src/conf/domain_conf.h | 1 + src/conf/domain_validate.c | 2 ++ src/conf/schemas/domaincommon.rng | 1 + src/qemu/qemu_command.c | 6 +++++- src/qemu/qemu_domain_address.c | 2 ++ src/qemu/qemu_validate.c | 15 +++++++++++++++ 8 files changed, 30 insertions(+), 3 deletions(-) diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index 60bee8bd4f..63bb565991 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -8729,8 +8729,9 @@ Example: ... ``model`` - Supported values are ``intel`` (for Q35 guests) ``smmuv3`` - (:since:`since 5.5.0`, for ARM virt guests), and ``virtio`` + Supported values are ``intel`` (for Q35 guests), ``smmuv3`` + (:since:`since 5.5.0`, for ARM virt guests), ``nestedSmmuv3`` + (for ARM virt guests), and ``virtio`` (:since:`since 8.3.0`, for Q35 and ARM virt guests). ``driver`` diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 4ad8289b89..c1092551e0 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -1349,6 +1349,7 @@ VIR_ENUM_IMPL(virDomainIOMMUModel, "intel", "smmuv3", "virtio", + "nestedSmmuv3", ); VIR_ENUM_IMPL(virDomainVsockModel, diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index a187ab4083..f8ab1b7d2f 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -2944,6 +2944,7 @@ typedef enum { VIR_DOMAIN_IOMMU_MODEL_INTEL, VIR_DOMAIN_IOMMU_MODEL_SMMUV3, VIR_DOMAIN_IOMMU_MODEL_VIRTIO, + VIR_DOMAIN_IOMMU_MODEL_NESTED_SMMUV3, VIR_DOMAIN_IOMMU_MODEL_LAST } virDomainIOMMUModel; diff --git a/src/conf/domain_validate.c b/src/conf/domain_validate.c index 1034bb57f5..c8b7b701bf 100644 --- a/src/conf/domain_validate.c +++ b/src/conf/domain_validate.c @@ -3001,6 +3001,7 @@ virDomainIOMMUDefValidate(const virDomainIOMMUDef *iommu) { switch (iommu->model) { case VIR_DOMAIN_IOMMU_MODEL_SMMUV3: + case VIR_DOMAIN_IOMMU_MODEL_NESTED_SMMUV3: case VIR_DOMAIN_IOMMU_MODEL_VIRTIO: if (iommu->intremap != VIR_TRISTATE_SWITCH_ABSENT || iommu->caching_mode != VIR_TRISTATE_SWITCH_ABSENT || @@ -3022,6 +3023,7 @@ virDomainIOMMUDefValidate(const virDomainIOMMUDef *iommu) switch (iommu->model) { case VIR_DOMAIN_IOMMU_MODEL_SMMUV3: + case VIR_DOMAIN_IOMMU_MODEL_NESTED_SMMUV3: case VIR_DOMAIN_IOMMU_MODEL_INTEL: if (iommu->info.type != VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE) { virReportError(VIR_ERR_XML_ERROR, diff --git a/src/conf/schemas/domaincommon.rng b/src/conf/schemas/domaincommon.rng index b3fdbf7ffb..de6a1e5c7e 100644 --- a/src/conf/schemas/domaincommon.rng +++ b/src/conf/schemas/domaincommon.rng @@ -6144,6 +6144,7 @@ <value>intel</value> <value>smmuv3</value> <value>virtio</value> + <value>nestedSmmuv3</value> </choice> </attribute> <interleave> diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index dcb9c4934e..6629addace 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -6063,6 +6063,9 @@ qemuBuildIOMMUCommandLine(virCommand *cmd, /* There is no -device for SMMUv3, so nothing to be done here */ return 0; + case VIR_DOMAIN_IOMMU_MODEL_NESTED_SMMUV3: + return 0; + case VIR_DOMAIN_IOMMU_MODEL_LAST: default: virReportEnumRangeError(virDomainIOMMUModel, iommu->model); @@ -6885,7 +6888,8 @@ qemuBuildMachineCommandLine(virCommand *cmd, case VIR_DOMAIN_IOMMU_MODEL_VIRTIO: /* These IOMMUs are formatted in qemuBuildIOMMUCommandLine */ break; - + case VIR_DOMAIN_IOMMU_MODEL_NESTED_SMMUV3: + break; case VIR_DOMAIN_IOMMU_MODEL_LAST: default: virReportEnumRangeError(virDomainIOMMUModel, def->iommu->model); diff --git a/src/qemu/qemu_domain_address.c b/src/qemu/qemu_domain_address.c index 970ae3949d..51f4bbd6eb 100644 --- a/src/qemu/qemu_domain_address.c +++ b/src/qemu/qemu_domain_address.c @@ -943,6 +943,7 @@ qemuDomainDeviceCalculatePCIConnectFlags(virDomainDeviceDef *dev, case VIR_DOMAIN_IOMMU_MODEL_INTEL: case VIR_DOMAIN_IOMMU_MODEL_SMMUV3: + case VIR_DOMAIN_IOMMU_MODEL_NESTED_SMMUV3: case VIR_DOMAIN_IOMMU_MODEL_LAST: /* These are not PCI devices */ return 0; @@ -2367,6 +2368,7 @@ qemuDomainAssignDevicePCISlots(virDomainDef *def, case VIR_DOMAIN_IOMMU_MODEL_INTEL: case VIR_DOMAIN_IOMMU_MODEL_SMMUV3: + case VIR_DOMAIN_IOMMU_MODEL_NESTED_SMMUV3: case VIR_DOMAIN_IOMMU_MODEL_LAST: /* These are not PCI devices */ break; diff --git a/src/qemu/qemu_validate.c b/src/qemu/qemu_validate.c index aaa056379e..9f07ffe4a3 100644 --- a/src/qemu/qemu_validate.c +++ b/src/qemu/qemu_validate.c @@ -5104,6 +5104,21 @@ qemuValidateDomainDeviceDefIOMMU(const virDomainIOMMUDef *iommu, } break; + case VIR_DOMAIN_IOMMU_MODEL_NESTED_SMMUV3: + if (!qemuDomainIsARMVirt(def)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("IOMMU device: '%1$s' is only supported with ARM Virt machines"), + virDomainIOMMUModelTypeToString(iommu->model)); + return -1; + } + if (!virQEMUCapsGet(qemuCaps, QEMU_CAPS_MACHINE_VIRT_IOMMU)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("IOMMU device: '%1$s' is not supported with this QEMU binary"), + virDomainIOMMUModelTypeToString(iommu->model)); + return -1; + } + break; + case VIR_DOMAIN_IOMMU_MODEL_LAST: default: virReportEnumRangeError(virDomainIOMMUModel, iommu->model); -- 2.34.1

Add support for parsing "nestedSmmuv3" devices from the VM definition. When a "nestedSmmuv3" IOMMU model is parsed from the VM definition, add a "nestedSmmuv3" device to the VM definition for each host SMMU node detected from sysfs. Specify the associated host SMMU sysfs filename as the "name" attribute for each "nestedSmmuv3" device in the VM definition. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- docs/formatdomain.rst | 20 ++++ src/ch/ch_domain.c | 1 + src/conf/domain_addr.h | 1 + src/conf/domain_conf.c | 186 ++++++++++++++++++++++++++++++ src/conf/domain_conf.h | 12 ++ src/conf/domain_postparse.c | 1 + src/conf/domain_validate.c | 22 ++++ src/conf/schemas/domaincommon.rng | 16 +++ src/conf/virconftypes.h | 2 + src/lxc/lxc_driver.c | 6 + src/qemu/qemu_command.c | 58 ++++++++++ src/qemu/qemu_command.h | 4 + src/qemu/qemu_domain.c | 2 + src/qemu/qemu_domain_address.c | 5 + src/qemu/qemu_driver.c | 3 + src/qemu/qemu_hotplug.c | 5 + src/qemu/qemu_postparse.c | 1 + src/qemu/qemu_validate.c | 1 + src/test/test_driver.c | 4 + tests/schemas/device.rng.in | 1 + 20 files changed, 351 insertions(+) diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index 63bb565991..739b5b1b89 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -8800,6 +8800,26 @@ The optional ``driver`` element allows to specify virtio options, see ... +NestedSmmuv3 +~~~~~~~~~~~~ +A representation of a host SMMU node. This is used to keep track of which +controller to assign a VFIO device to in the VM if it is associated with a +host SMMU node. This supports having multiple vSMMU nodes in the VM by +attaching devices with different SMMU nodes to different pcie-expander-bus +controllers in the VM. The ``name`` attribute denotes the host SMMU node's +identifier parsed from the host's sysfs. + +:: + + ... + <devices> + <nestedSmmuv3> + <name>smmu3.0x0000000005000000</name> + </nestedSmmuv3> + </devices> + ... + + Crypto ~~~~~~ diff --git a/src/ch/ch_domain.c b/src/ch/ch_domain.c index bfccabed49..7e7ca1c6ac 100644 --- a/src/ch/ch_domain.c +++ b/src/ch/ch_domain.c @@ -180,6 +180,7 @@ chValidateDomainDeviceDef(const virDomainDeviceDef *dev, case VIR_DOMAIN_DEVICE_IOMMU: case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_PSTORE: virReportError(VIR_ERR_CONFIG_UNSUPPORTED, _("Cloud-Hypervisor doesn't support '%1$s' device"), diff --git a/src/conf/domain_addr.h b/src/conf/domain_addr.h index e72fb48847..9781685903 100644 --- a/src/conf/domain_addr.h +++ b/src/conf/domain_addr.h @@ -57,6 +57,7 @@ typedef enum { VIR_PCI_CONNECT_TYPE_PCIE_EXPANDER_BUS = 1 << 11, VIR_PCI_CONNECT_TYPE_PCI_BRIDGE = 1 << 12, VIR_PCI_CONNECT_TYPE_PCIE_TO_PCI_BRIDGE = 1 << 13, + VIR_PCI_CONNECT_TYPE_NESTED_SMMUV3 = 1 << 14, } virDomainPCIConnectFlags; /* a combination of all bits that describe the type of connections diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index c1092551e0..24aff1cfbe 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -341,6 +341,7 @@ VIR_ENUM_IMPL(virDomainDevice, "audio", "crypto", "pstore", + "nestedSmmuv3", ); VIR_ENUM_IMPL(virDomainDiskDevice, @@ -3449,6 +3450,19 @@ virDomainHostdevDefNew(void) } +virDomainNestedSmmuv3Def * +virDomainNestedSmmuv3DefNew(void) +{ + virDomainNestedSmmuv3Def *def; + + def = g_new0(virDomainNestedSmmuv3Def, 1); + + def->info = g_new0(virDomainDeviceInfo, 1); + + return def; +} + + static virDomainTPMDef * virDomainTPMDefNew(virDomainXMLOption *xmlopt) { @@ -3509,6 +3523,18 @@ void virDomainHostdevDefFree(virDomainHostdevDef *def) g_free(def); } +void virDomainNestedSmmuv3DefFree(virDomainNestedSmmuv3Def *def) +{ + if (!def) + return; + + g_free(def->name); + + virDomainDeviceInfoFree(def->info); + + g_free(def); +} + void virDomainHubDefFree(virDomainHubDef *def) { if (!def) @@ -3671,6 +3697,9 @@ void virDomainDeviceDefFree(virDomainDeviceDef *def) case VIR_DOMAIN_DEVICE_PSTORE: virDomainPstoreDefFree(def->data.pstore); break; + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: + virDomainNestedSmmuv3DefFree(def->data.nestedsmmuv3); + break; case VIR_DOMAIN_DEVICE_LAST: case VIR_DOMAIN_DEVICE_NONE: break; @@ -4597,6 +4626,8 @@ virDomainDeviceGetInfo(const virDomainDeviceDef *device) return &device->data.crypto->info; case VIR_DOMAIN_DEVICE_PSTORE: return &device->data.pstore->info; + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: + return device->data.nestedsmmuv3->info; /* The following devices do not contain virDomainDeviceInfo */ case VIR_DOMAIN_DEVICE_LEASE: @@ -4705,6 +4736,9 @@ virDomainDeviceSetData(virDomainDeviceDef *device, case VIR_DOMAIN_DEVICE_PSTORE: device->data.pstore = devicedata; break; + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: + device->data.nestedsmmuv3 = devicedata; + break; case VIR_DOMAIN_DEVICE_NONE: case VIR_DOMAIN_DEVICE_LAST: break; @@ -4930,6 +4964,13 @@ virDomainDeviceInfoIterateFlags(virDomainDef *def, return rc; } + device.type = VIR_DOMAIN_DEVICE_NESTED_SMMUV3; + for (i = 0; i < def->nnestedsmmus; i++) { + device.data.nestedsmmuv3 = def->nestedsmmus[i]; + if ((rc = cb(def, &device, def->nestedsmmus[i]->info, opaque)) != 0) + return rc; + } + /* If the flag below is set, make sure @cb can handle @info being NULL */ if (iteratorFlags & DOMAIN_DEVICE_ITERATE_MISSING_INFO) { device.type = VIR_DOMAIN_DEVICE_GRAPHICS; @@ -4990,6 +5031,7 @@ virDomainDeviceInfoIterateFlags(virDomainDef *def, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: break; } #endif @@ -13365,6 +13407,40 @@ virDomainHostdevDefParseXML(virDomainXMLOption *xmlopt, } +static virDomainNestedSmmuv3Def * +virDomainNestedSmmuv3DefParseXML(virDomainXMLOption *xmlopt, + xmlNodePtr node, + xmlXPathContextPtr ctxt, + unsigned int flags) +{ + virDomainNestedSmmuv3Def *def; + size_t nameLength; + VIR_XPATH_NODE_AUTORESTORE(ctxt) + + ctxt->node = node; + + if (!(def = virDomainNestedSmmuv3DefNew())) + goto error; + + nameLength = strlen(virXPathString("string(./name)", ctxt)) + 1; + VIR_REALLOC_N(def->name, nameLength); + if (!def->name) + goto error; + virStrcpy(def->name, virXPathString("string(./name)", ctxt), nameLength); + + if (def->info->type == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE) { + if (virDomainDeviceInfoParseXML(xmlopt, node, ctxt, def->info, + flags) < 0) + goto error; + } + return def; + + error: + virDomainNestedSmmuv3DefFree(def); + return NULL; +} + + static virDomainRedirdevDef * virDomainRedirdevDefParseXML(virDomainXMLOption *xmlopt, xmlNodePtr node, @@ -14366,6 +14442,12 @@ virDomainDeviceDefParse(const char *xmlStr, return NULL; } break; + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: + if (!(dev->data.nestedsmmuv3 = virDomainNestedSmmuv3DefParseXML(xmlopt, node, + ctxt, flags))) { + return NULL; + } + break; case VIR_DOMAIN_DEVICE_NONE: case VIR_DOMAIN_DEVICE_LAST: break; @@ -19445,6 +19527,21 @@ virDomainDefParseXML(xmlXPathContextPtr ctxt, VIR_FREE(nodes); + /* analysis of the nested SMMUs */ + if ((n = virXPathNodeSet("./devices/nestedSmmuv3", ctxt, &nodes)) < 0) + return NULL; + if (n > 0) + VIR_REALLOC_N(def->nestedsmmus, def->nnestedsmmus + n); + for (i = 0; i < n; i++) { + virDomainNestedSmmuv3Def *nestedsmmuv3; + nestedsmmuv3 = virDomainNestedSmmuv3DefParseXML(xmlopt, nodes[i], ctxt, + flags); + if (!nestedsmmuv3) + return NULL; + def->nestedsmmus[def->nnestedsmmus++] = nestedsmmuv3; + } + VIR_FREE(nodes); + /* analysis of the host devices */ if ((n = virXPathNodeSet("./devices/hostdev", ctxt, &nodes)) < 0) return NULL; @@ -20664,6 +20761,20 @@ virDomainHostdevDefCheckABIStability(virDomainHostdevDef *src, } +static bool +virDomainNestedSmmuv3DefCheckABIStability(virDomainNestedSmmuv3Def *src, + virDomainNestedSmmuv3Def *dst) +{ + if (STRNEQ(src->name, dst->name)) + return false; + + if (!virDomainDeviceInfoCheckABIStability(src->info, dst->info)) + return false; + + return true; +} + + static bool virDomainSmartcardDefCheckABIStability(virDomainSmartcardDef *src, virDomainSmartcardDef *dst) @@ -22001,6 +22112,18 @@ virDomainDefCheckABIStabilityFlags(virDomainDef *src, goto error; } + if (src->nnestedsmmus != dst->nnestedsmmus) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Target domain nested SMMUs count %1$zu does not match source %2$zu"), + dst->nnestedsmmus, src->nnestedsmmus); + goto error; + } + + for (i = 0; i < src->nnestedsmmus; i++) + if (!virDomainNestedSmmuv3DefCheckABIStability(src->nestedsmmus[i], + dst->nestedsmmus[i])) + goto error; + if ((!src->redirfilter && dst->redirfilter) || (src->redirfilter && !dst->redirfilter)) { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, @@ -22171,6 +22294,7 @@ virDomainDefCheckABIStabilityFlags(virDomainDef *src, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: break; } #endif @@ -22366,6 +22490,36 @@ virDomainDefAddImplicitVideo(virDomainDef *def, virDomainXMLOption *xmlopt) return 0; } +static int +virDomainDefAddImplicitNestedSmmuv3(virDomainDef *def) +{ + // Get the number of host-level SMMUv3 instances + g_autoptr(DIR) dir = NULL; + struct dirent *dent; + int num = 0; + + virDomainNestedSmmuv3Def* nestedsmmuv3 = NULL; + + if (virDirOpen(&dir, "/sys/class/iommu") < 0) + return -1; + + while (virDirRead(dir, &dent, "/sys/class/iommu") > 0) { + if (!(nestedsmmuv3 = virDomainNestedSmmuv3DefNew())) + return -1; + if (STRPREFIX(dent->d_name, "smmu3.0x")) { + VIR_REALLOC_N(nestedsmmuv3->name, strlen(dent->d_name) + 1); + virStrcpy(nestedsmmuv3->name, dent->d_name, strlen(dent->d_name) + 1); + VIR_REALLOC_N(def->nestedsmmus, def->nnestedsmmus + 1); + def->nestedsmmus[def->nnestedsmmus++] = nestedsmmuv3; + num++; + } + } + if (num == 0) + return -1; + + return 0; +} + int virDomainDefAddImplicitDevices(virDomainDef *def, virDomainXMLOption *xmlopt) { @@ -22379,6 +22533,13 @@ virDomainDefAddImplicitDevices(virDomainDef *def, virDomainXMLOption *xmlopt) if (virDomainDefAddImplicitVideo(def, xmlopt) < 0) return -1; + if (def->iommu != NULL && + def->iommu->model == VIR_DOMAIN_IOMMU_MODEL_NESTED_SMMUV3 && + def->nnestedsmmus < 1) { + if (virDomainDefAddImplicitNestedSmmuv3(def) < 0) + return -1; + } + return 0; } @@ -26814,6 +26975,24 @@ virDomainHostdevDefFormat(virBuffer *buf, return 0; } +static int +virDomainNestedSmmuv3DefFormat(virBuffer *buf, + virDomainNestedSmmuv3Def *def, + unsigned int flags) +{ + virBufferAddLit(buf, "<nestedSmmuv3>\n"); + virBufferAdjustIndent(buf, 2); + + virBufferAsprintf(buf, "<name>%s</name>\n", def->name); + + virDomainDeviceInfoFormat(buf, def->info, flags); + + virBufferAdjustIndent(buf, -2); + virBufferAddLit(buf, "</nestedSmmuv3>\n"); + + return 0; +} + static int virDomainRedirdevDefFormat(virBuffer *buf, virDomainRedirdevDef *def, @@ -28672,6 +28851,12 @@ virDomainDefFormatInternalSetRootName(virDomainDef *def, for (n = 0; n < def->ncryptos; n++) { virDomainCryptoDefFormat(buf, def->cryptos[n], flags); } + + for (n = 0; n < def->nnestedsmmus; n++) { + if (virDomainNestedSmmuv3DefFormat(buf, def->nestedsmmus[n], flags) < 0) + return -1; + } + if (def->iommu) virDomainIOMMUDefFormat(buf, def->iommu); @@ -28841,6 +29026,7 @@ virDomainDeviceIsUSB(virDomainDeviceDef *dev, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: break; } diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index f8ab1b7d2f..edde9b63d6 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -88,6 +88,7 @@ typedef enum { VIR_DOMAIN_DEVICE_AUDIO, VIR_DOMAIN_DEVICE_CRYPTO, VIR_DOMAIN_DEVICE_PSTORE, + VIR_DOMAIN_DEVICE_NESTED_SMMUV3, VIR_DOMAIN_DEVICE_LAST } virDomainDeviceType; @@ -122,6 +123,7 @@ struct _virDomainDeviceDef { virDomainAudioDef *audio; virDomainCryptoDef *crypto; virDomainPstoreDef *pstore; + virDomainNestedSmmuv3Def *nestedsmmuv3; } data; }; @@ -353,6 +355,11 @@ typedef enum { VIR_DOMAIN_STARTUP_POLICY_LAST } virDomainStartupPolicy; +struct _virDomainNestedSmmuv3Def { + char *name; + virDomainDeviceInfo *info; /* Guest address */ +}; + /* basic device for direct passthrough */ struct _virDomainHostdevDef { /* If 'parentnet' is non-NULL it means this host dev was @@ -3197,6 +3204,9 @@ struct _virDomainDef { size_t ntpms; virDomainTPMDef **tpms; + size_t nnestedsmmus; + virDomainNestedSmmuv3Def **nestedsmmus; + /* Only 1 */ virDomainMemballoonDef *memballoon; virDomainNVRAMDef *nvram; @@ -3677,6 +3687,8 @@ virDomainVideoDef *virDomainVideoDefNew(virDomainXMLOption *xmlopt); void virDomainVideoDefFree(virDomainVideoDef *def); G_DEFINE_AUTOPTR_CLEANUP_FUNC(virDomainVideoDef, virDomainVideoDefFree); void virDomainVideoDefClear(virDomainVideoDef *def); +virDomainNestedSmmuv3Def *virDomainNestedSmmuv3DefNew(void); +void virDomainNestedSmmuv3DefFree(virDomainNestedSmmuv3Def *def); virDomainHostdevDef *virDomainHostdevDefNew(void); void virDomainHostdevDefFree(virDomainHostdevDef *def); void virDomainHubDefFree(virDomainHubDef *def); diff --git a/src/conf/domain_postparse.c b/src/conf/domain_postparse.c index bf33f29638..b838a9688e 100644 --- a/src/conf/domain_postparse.c +++ b/src/conf/domain_postparse.c @@ -757,6 +757,7 @@ virDomainDeviceDefPostParseCommon(virDomainDeviceDef *dev, case VIR_DOMAIN_DEVICE_IOMMU: case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_PSTORE: ret = 0; break; diff --git a/src/conf/domain_validate.c b/src/conf/domain_validate.c index c8b7b701bf..20d16c8b84 100644 --- a/src/conf/domain_validate.c +++ b/src/conf/domain_validate.c @@ -2310,6 +2310,25 @@ virDomainHostdevDefValidate(const virDomainHostdevDef *hostdev) } +static int +virDomainNestedSmmuv3DefValidate(const virDomainNestedSmmuv3Def *nestedsmmuv3) +{ + if (nestedsmmuv3->info->type != VIR_DOMAIN_DEVICE_ADDRESS_TYPE_NONE && + nestedsmmuv3->info->type != VIR_DOMAIN_DEVICE_ADDRESS_TYPE_PCI) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("nestedsmmuv3 must use 'pci' address type")); + return -1; + } + + if (nestedsmmuv3->name[0] == '\0') { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("nestedsmmuv3 must have an associated sysfs node name")); + return -1; + } + return 0; +} + + /** * virDomainMemoryGetMappedSize: * @mem: memory device definition @@ -3230,6 +3249,9 @@ virDomainDeviceDefValidateInternal(const virDomainDeviceDef *dev, case VIR_DOMAIN_DEVICE_PSTORE: return virDomainPstoreDefValidate(dev->data.pstore); + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: + return virDomainNestedSmmuv3DefValidate(dev->data.nestedsmmuv3); + case VIR_DOMAIN_DEVICE_LEASE: case VIR_DOMAIN_DEVICE_WATCHDOG: case VIR_DOMAIN_DEVICE_HUB: diff --git a/src/conf/schemas/domaincommon.rng b/src/conf/schemas/domaincommon.rng index de6a1e5c7e..e39559952e 100644 --- a/src/conf/schemas/domaincommon.rng +++ b/src/conf/schemas/domaincommon.rng @@ -6359,6 +6359,21 @@ </element> </define> + <define name="nestedSmmuv3"> + <element name="nestedSmmuv3"> + <interleave> + <optional> + <element name="name"> + <text/> + </element> + </optional> + <optional> + <ref name="address"/> + </optional> + </interleave> + </element> + </define> + <define name="hostdev"> <element name="hostdev"> <interleave> @@ -6808,6 +6823,7 @@ <ref name="shmem"/> <ref name="memorydev"/> <ref name="crypto"/> + <ref name="nestedSmmuv3"/> </choice> </zeroOrMore> <zeroOrMore> diff --git a/src/conf/virconftypes.h b/src/conf/virconftypes.h index 59be61cea4..b36020d9a5 100644 --- a/src/conf/virconftypes.h +++ b/src/conf/virconftypes.h @@ -118,6 +118,8 @@ typedef struct _virDomainHostdevCaps virDomainHostdevCaps; typedef struct _virDomainHostdevDef virDomainHostdevDef; +typedef struct _virDomainNestedSmmuv3Def virDomainNestedSmmuv3Def; + typedef struct _virDomainHostdevSubsys virDomainHostdevSubsys; typedef struct _virDomainHostdevSubsysMediatedDev virDomainHostdevSubsysMediatedDev; diff --git a/src/lxc/lxc_driver.c b/src/lxc/lxc_driver.c index 2488940feb..5a7fb9b62d 100644 --- a/src/lxc/lxc_driver.c +++ b/src/lxc/lxc_driver.c @@ -3021,6 +3021,7 @@ lxcDomainAttachDeviceConfig(virDomainDef *vmdef, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", _("persistent attach of device is not supported")); break; @@ -3087,6 +3088,7 @@ lxcDomainUpdateDeviceConfig(virDomainDef *vmdef, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", _("persistent update of device is not supported")); break; @@ -3169,6 +3171,7 @@ lxcDomainDetachDeviceConfig(virDomainDef *vmdef, case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", _("persistent detach of device is not supported")); break; @@ -3271,6 +3274,7 @@ lxcDomainAttachDeviceMknodHelper(pid_t pid G_GNUC_UNUSED, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: virReportError(VIR_ERR_INTERNAL_ERROR, _("Unexpected device type %1$d"), data->def->type); @@ -3947,6 +3951,7 @@ lxcDomainAttachDeviceLive(virLXCDriver *driver, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: virReportError(VIR_ERR_CONFIG_UNSUPPORTED, _("device type '%1$s' cannot be attached"), virDomainDeviceTypeToString(dev->type)); @@ -4365,6 +4370,7 @@ lxcDomainDetachDeviceLive(virLXCDriver *driver, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: virReportError(VIR_ERR_CONFIG_UNSUPPORTED, _("device type '%1$s' cannot be detached"), virDomainDeviceTypeToString(dev->type)); diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 6629addace..3186e60a0e 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -975,6 +975,7 @@ qemuBuildVirtioDevGetConfigDev(const virDomainDeviceDef *device, case VIR_DOMAIN_DEVICE_IOMMU: case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_LAST: default: break; @@ -4564,6 +4565,33 @@ qemuBuildVideoCommandLine(virCommand *cmd, } +virJSONValue * +qemuBuildPCINestedSmmuv3DevProps(const virDomainDef *def, + virDomainNestedSmmuv3Def *nestedsmmuv3) +{ + g_autoptr(virJSONValue) props = NULL; + g_autofree char *pciaddr = NULL; + g_autofree char *bus = qemuBuildDeviceAddressPCIGetBus(def, nestedsmmuv3->info); + + if (!bus) + return NULL; + + if (nestedsmmuv3->info->addr.pci.function != 0) + pciaddr = g_strdup_printf("0x%x.0x%x", nestedsmmuv3->info->addr.pci.slot, + nestedsmmuv3->info->addr.pci.function); + else + pciaddr = g_strdup_printf("0x%x", nestedsmmuv3->info->addr.pci.slot); + + if (virJSONValueObjectAdd(&props, + "s:driver", "arm-smmuv3-nested", + "s:bus", bus, + NULL) < 0) + return NULL; + + return g_steal_pointer(&props); +} + + virJSONValue * qemuBuildPCIHostdevDevProps(const virDomainDef *def, virDomainHostdevDef *dev) @@ -4988,6 +5016,33 @@ qemuBuildHostdevSCSICommandLine(virCommand *cmd, } +static int +qemuBuildNestedSmmuv3CommandLine(virCommand *cmd, + const virDomainDef *def, + virQEMUCaps *qemuCaps) +{ + size_t i; + + for (i = 0; i < def->nnestedsmmus; i++) { + g_autoptr(virJSONValue) devprops = NULL; + virDomainNestedSmmuv3Def *nestedsmmuv3 = def->nestedsmmus[i]; + + if (nestedsmmuv3->info->type == VIR_DOMAIN_DEVICE_ADDRESS_TYPE_UNASSIGNED) + continue; + + if (qemuCommandAddExtDevice(cmd, nestedsmmuv3->info, def, qemuCaps) < 0) + return -1; + + if (!(devprops = qemuBuildPCINestedSmmuv3DevProps(def, nestedsmmuv3))) + return -1; + + if (qemuBuildDeviceCommandlineFromJSON(cmd, devprops, def, qemuCaps) < 0) + return -1; + } + return 0; +} + + static int qemuBuildHostdevCommandLine(virCommand *cmd, const virDomainDef *def, @@ -10558,6 +10613,9 @@ qemuBuildCommandLine(virDomainObj *vm, if (qemuBuildHostdevCommandLine(cmd, def, qemuCaps) < 0) return NULL; + if (qemuBuildNestedSmmuv3CommandLine(cmd, def, qemuCaps) < 0) + return NULL; + if (migrateURI) virCommandAddArgList(cmd, "-incoming", migrateURI, NULL); diff --git a/src/qemu/qemu_command.h b/src/qemu/qemu_command.h index 76c514b5f7..f09dcc514c 100644 --- a/src/qemu/qemu_command.h +++ b/src/qemu/qemu_command.h @@ -157,6 +157,10 @@ qemuBuildThreadContextProps(virJSONValue **tcProps, qemuDomainObjPrivate *priv, virBitmap *nodemask); +virJSONValue * +qemuBuildPCINestedSmmuv3DevProps(const virDomainDef *def, + virDomainNestedSmmuv3Def *nestedsmmuv3); + /* Current, best practice */ virJSONValue * qemuBuildPCIHostdevDevProps(const virDomainDef *def, diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 3366346624..2567d620d9 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -8718,6 +8718,7 @@ qemuDomainPrepareChardevSourceOne(virDomainDeviceDef *dev, case VIR_DOMAIN_DEVICE_VSOCK: case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_PSTORE: break; } @@ -10670,6 +10671,7 @@ qemuDomainDeviceBackendChardevForeachOne(virDomainDeviceDef *dev, case VIR_DOMAIN_DEVICE_VSOCK: case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_PSTORE: /* no chardev backend */ break; diff --git a/src/qemu/qemu_domain_address.c b/src/qemu/qemu_domain_address.c index 51f4bbd6eb..31004bfc7e 100644 --- a/src/qemu/qemu_domain_address.c +++ b/src/qemu/qemu_domain_address.c @@ -470,6 +470,7 @@ qemuDomainDeviceSupportZPCI(virDomainDeviceDef *device) case VIR_DOMAIN_DEVICE_VSOCK: case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_PSTORE: break; @@ -819,6 +820,10 @@ qemuDomainDeviceCalculatePCIConnectFlags(virDomainDeviceDef *dev, return pciFlags; } + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: { + return VIR_PCI_CONNECT_TYPE_NESTED_SMMUV3; + } + case VIR_DOMAIN_DEVICE_MEMBALLOON: switch (dev->data.memballoon->model) { case VIR_DOMAIN_MEMBALLOON_MODEL_VIRTIO_TRANSITIONAL: diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 09f7edda7d..d9ade7000b 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -6880,6 +6880,7 @@ qemuDomainAttachDeviceConfig(virDomainDef *vmdef, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_LAST: virReportError(VIR_ERR_OPERATION_UNSUPPORTED, _("persistent attach of device '%1$s' is not supported"), @@ -7099,6 +7100,7 @@ qemuDomainDetachDeviceConfig(virDomainDef *vmdef, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_LAST: virReportError(VIR_ERR_OPERATION_UNSUPPORTED, _("persistent detach of device '%1$s' is not supported"), @@ -7225,6 +7227,7 @@ qemuDomainUpdateDeviceConfig(virDomainDef *vmdef, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_LAST: virReportError(VIR_ERR_OPERATION_UNSUPPORTED, _("persistent update of device '%1$s' is not supported"), diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 3c18af6b0c..0a67df8605 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -3521,6 +3521,7 @@ qemuDomainAttachDeviceLive(virDomainObj *vm, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_LAST: virReportError(VIR_ERR_OPERATION_UNSUPPORTED, _("live attach of device '%1$s' is not supported"), @@ -5420,6 +5421,7 @@ qemuDomainRemoveAuditDevice(virDomainObj *vm, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_LAST: /* libvirt doesn't yet support detaching these devices */ break; @@ -5525,6 +5527,7 @@ qemuDomainRemoveDevice(virQEMUDriver *driver, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_LAST: virReportError(VIR_ERR_OPERATION_UNSUPPORTED, _("don't know how to remove a %1$s device"), @@ -6420,6 +6423,7 @@ qemuDomainDetachDeviceLive(virDomainObj *vm, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_LAST: virReportError(VIR_ERR_OPERATION_UNSUPPORTED, _("live detach of device '%1$s' is not supported"), @@ -7412,6 +7416,7 @@ qemuDomainUpdateDeviceLive(virDomainObj *vm, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_LAST: virReportError(VIR_ERR_CONFIG_UNSUPPORTED, _("live update of device '%1$s' is not supported"), diff --git a/src/qemu/qemu_postparse.c b/src/qemu/qemu_postparse.c index 03b5ef825a..4e8892b838 100644 --- a/src/qemu/qemu_postparse.c +++ b/src/qemu/qemu_postparse.c @@ -897,6 +897,7 @@ qemuDomainDeviceDefPostParse(virDomainDeviceDef *dev, case VIR_DOMAIN_DEVICE_NVRAM: case VIR_DOMAIN_DEVICE_RNG: case VIR_DOMAIN_DEVICE_AUDIO: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_CRYPTO: ret = 0; break; diff --git a/src/qemu/qemu_validate.c b/src/qemu/qemu_validate.c index 9f07ffe4a3..d84f04903f 100644 --- a/src/qemu/qemu_validate.c +++ b/src/qemu/qemu_validate.c @@ -5470,6 +5470,7 @@ qemuValidateDomainDeviceDef(const virDomainDeviceDef *dev, case VIR_DOMAIN_DEVICE_PSTORE: return qemuValidateDomainDeviceDefPstore(dev->data.pstore, def, qemuCaps); + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_LEASE: case VIR_DOMAIN_DEVICE_PANIC: case VIR_DOMAIN_DEVICE_NONE: diff --git a/src/test/test_driver.c b/src/test/test_driver.c index f1cefb5c50..c63e76baa5 100644 --- a/src/test/test_driver.c +++ b/src/test/test_driver.c @@ -10451,6 +10451,7 @@ testDomainAttachDeviceLive(virDomainObj *vm, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_LAST: virReportError(VIR_ERR_OPERATION_UNSUPPORTED, _("live attach of device '%1$s' is not supported"), @@ -10594,6 +10595,7 @@ testDomainUpdateDevice(virDomainDef *vmdef, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_LAST: virReportError(VIR_ERR_OPERATION_UNSUPPORTED, _("persistent update of device '%1$s' is not supported"), @@ -10965,6 +10967,7 @@ testDomainRemoveDevice(testDriver *driver, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_LAST: virReportError(VIR_ERR_OPERATION_UNSUPPORTED, _("live detach of device '%1$s' is not supported"), @@ -11036,6 +11039,7 @@ testDomainDetachDeviceLive(testDriver *driver, case VIR_DOMAIN_DEVICE_AUDIO: case VIR_DOMAIN_DEVICE_CRYPTO: case VIR_DOMAIN_DEVICE_PSTORE: + case VIR_DOMAIN_DEVICE_NESTED_SMMUV3: case VIR_DOMAIN_DEVICE_LAST: virReportError(VIR_ERR_OPERATION_UNSUPPORTED, _("live detach of device '%1$s' is not supported"), diff --git a/tests/schemas/device.rng.in b/tests/schemas/device.rng.in index b322b5275e..fc59700574 100644 --- a/tests/schemas/device.rng.in +++ b/tests/schemas/device.rng.in @@ -45,6 +45,7 @@ <ref name="panic"/> <ref name="iommu"/> <ref name="vsock"/> + <ref name="nestedSmmuv3"/> </choice> </start> -- 2.34.1

Add a pcie-expander-bus controller to the VM definition for each "nestedSmmuv3" device that is generated when the "nestedSmmuv3" IOMMU model is parsed from the VM definition. Assign each "nestedSmmuv3" device to one PXB controller, and route any unmanaged "hostdev" VFIO devices with associated host SMMU nodes to their corresponding PXB controller based on the "name" attributes of "nestedSmmuv3" devices attached to these PXB controllers. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/conf/domain_addr.c | 26 ++++++- src/conf/domain_addr.h | 3 +- src/conf/domain_conf.c | 1 + src/qemu/qemu_domain_address.c | 134 +++++++++++++++++++++++++++++++++ 4 files changed, 162 insertions(+), 2 deletions(-) diff --git a/src/conf/domain_addr.c b/src/conf/domain_addr.c index a53ff6df6c..6d8ca89025 100644 --- a/src/conf/domain_addr.c +++ b/src/conf/domain_addr.c @@ -386,6 +386,8 @@ virDomainPCIAddressFlagsCompatible(virPCIDeviceAddress *addr, connectStr = "pcie-expander-bus"; } else if (devFlags & VIR_PCI_CONNECT_TYPE_PCI_BRIDGE) { connectStr = "pci-bridge"; + } else if (devFlags & VIR_PCI_CONNECT_TYPE_NESTED_SMMUV3) { + connectStr = "nestedSmmuv3 device"; } else { /* this should never happen. If it does, there is a * bug in the code that sets the flag bits for devices. @@ -565,7 +567,8 @@ virDomainPCIAddressBusSetModel(virDomainPCIAddressBus *bus, * dmi-to-pci-bridge */ bus->flags = (VIR_PCI_CONNECT_TYPE_PCIE_ROOT_PORT | - VIR_PCI_CONNECT_TYPE_DMI_TO_PCI_BRIDGE); + VIR_PCI_CONNECT_TYPE_DMI_TO_PCI_BRIDGE | + VIR_PCI_CONNECT_TYPE_NESTED_SMMUV3); bus->minSlot = 0; bus->maxSlot = VIR_PCI_ADDRESS_SLOT_LAST; break; @@ -690,6 +693,8 @@ virDomainPCIAddressSetGrow(virDomainPCIAddressSet *addrs, } else if (flags & (VIR_PCI_CONNECT_TYPE_PCIE_DEVICE | VIR_PCI_CONNECT_TYPE_PCIE_SWITCH_UPSTREAM_PORT)) { model = VIR_DOMAIN_CONTROLLER_MODEL_PCIE_ROOT_PORT; + } else if (flags & VIR_PCI_CONNECT_TYPE_NESTED_SMMUV3) { + model = VIR_DOMAIN_CONTROLLER_MODEL_PCIE_EXPANDER_BUS; } else { /* The types of devices that we can't auto-add a controller for: * @@ -1030,6 +1035,11 @@ virDomainPCIAddressFindUnusedFunctionOnBus(virDomainPCIAddressBus *bus, break; } + if (flags == VIR_PCI_CONNECT_TYPE_NESTED_SMMUV3) { + *found = false; + break; + } + if (flags & VIR_PCI_CONNECT_AGGREGATE_SLOT && bus->slot[searchAddr->slot].aggregate) { /* slot and device are okay with aggregating devices */ @@ -1087,6 +1097,20 @@ virDomainPCIAddressGetNextAddr(virDomainPCIAddressSet *addrs, else a.function = function; + if (flags == VIR_PCI_CONNECT_TYPE_NESTED_SMMUV3) { + if (addrs->dryRun) { + virDomainPCIAddressBus *bus = &addrs->buses[addrs->nbuses - 1]; + /* a is already set to the first new bus */ + a.bus = addrs->nbuses; + a.slot = bus->minSlot; + if (virDomainPCIAddressSetGrow(addrs, &a, flags) < 0) + return -1; + /* this device will use the first slot of the new bus */ + a.slot = addrs->buses[a.bus].minSlot; + goto success; + } + } + /* When looking for a suitable bus for the device, start by being * very strict and ignoring all those where the isolation groups * don't match. This ensures all devices sharing the same isolation diff --git a/src/conf/domain_addr.h b/src/conf/domain_addr.h index 9781685903..2881f2dadb 100644 --- a/src/conf/domain_addr.h +++ b/src/conf/domain_addr.h @@ -72,7 +72,8 @@ typedef enum { VIR_PCI_CONNECT_TYPE_PCI_EXPANDER_BUS | \ VIR_PCI_CONNECT_TYPE_PCIE_EXPANDER_BUS | \ VIR_PCI_CONNECT_TYPE_PCI_BRIDGE | \ - VIR_PCI_CONNECT_TYPE_PCIE_TO_PCI_BRIDGE) + VIR_PCI_CONNECT_TYPE_PCIE_TO_PCI_BRIDGE | \ + VIR_PCI_CONNECT_TYPE_NESTED_SMMUV3) /* combination of all bits that could be used to connect a normal * endpoint device (i.e. excluding the connection possible between an diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 24aff1cfbe..46f9b9b0cf 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -25,6 +25,7 @@ #include <fcntl.h> #include <sys/stat.h> #include <unistd.h> +#include <dirent.h> #include "configmake.h" #include "internal.h" diff --git a/src/qemu/qemu_domain_address.c b/src/qemu/qemu_domain_address.c index 31004bfc7e..dee198a7d2 100644 --- a/src/qemu/qemu_domain_address.c +++ b/src/qemu/qemu_domain_address.c @@ -1627,6 +1627,19 @@ qemuDomainPCIAddressSetCreate(virDomainDef *def, addrs->dryRun = dryRun; + /* PXB indices must come before pcie-root-port indices in qemu, + * so add PXB buses to addrs before the pcie-root-ports. */ + + if (addrs->dryRun) { + for (i = 0; i < def->nnestedsmmus; i++) { + if (!virDeviceInfoPCIAddressIsWanted(def->nestedsmmus[i]->info)) + continue; + if (qemuDomainPCIAddressReserveNextAddr(addrs, + def->nestedsmmus[i]->info) < 0) + return NULL; + } + } + /* pSeries domains support multiple pci-root controllers */ if (qemuDomainIsPSeries(def)) addrs->areMultipleRootsSupported = true; @@ -2030,6 +2043,109 @@ qemuDomainValidateDevicePCISlotsChipsets(virDomainDef *def, } +static char* +retrieveSysfsDevPath(virPCIDeviceAddress* addr, const char* path) +{ + return g_strdup_printf("/sys/bus/pci/devices/%04x:%02x:%02x.%01x%s", + addr->domain, + addr->bus, + addr->slot, + addr->function, + path ? path : ""); +} + + +static char * +nestedSmmuVfioHostdevFound(virDomainHostdevDef *hostdev, bool dryRun) +{ + char* devPath = NULL; + char* devSmmuPath = NULL; + char* devVFIOPath = NULL; + g_autoptr(DIR) dir = NULL; + g_autoptr(DIR) smmuDir = NULL; + g_autoptr(DIR) VFIODir = NULL; + char* dir_iommu = NULL; + char* smmu_node = NULL; + devPath = retrieveSysfsDevPath(&hostdev->source.subsys.u.pci.addr, ""); + if (virDirOpenIfExists(&dir, devPath) < 1) + return NULL; + devSmmuPath = retrieveSysfsDevPath(&hostdev->source.subsys.u.pci.addr, "/iommu"); + if (virDirOpenIfExists(&smmuDir, devSmmuPath) < 1) + return NULL; + devVFIOPath = retrieveSysfsDevPath(&hostdev->source.subsys.u.pci.addr, "/vfio-dev"); + if (hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && + (hostdev->info->addr.pci.bus != 0 || dryRun)) { + // We only want to route vfio hostdevs + if (hostdev->managed || + (virDirOpenIfExists(&VFIODir, devVFIOPath) == 1)) { + // Get the hostdev's associated SMMU node name + dir_iommu = realpath(devSmmuPath, NULL); + if (!dir_iommu) + return NULL; + smmu_node = g_path_get_basename(dir_iommu); + if (!smmu_node) + return NULL; + } + } + return smmu_node; +} + + +static virDomainControllerDef * +qemuDomainGetUpstreamCont(virDomainDef *def, + virDomainDeviceInfo *downstreamInfo, + int model) +{ + size_t i; + for (i = 0; i < def->ncontrollers; i++) { + if (def->controllers[i]->idx == downstreamInfo->addr.pci.bus && + def->controllers[i]->model == model) + return def->controllers[i]; + } + return NULL; +} + + +static int +qemuDomainAssignNestedSmmuv3HostdevSlots(virDomainDef *def, + virDomainPCIAddressSet *addrs) +{ + size_t i, j; + char* smmu_node = NULL; + virDomainControllerDef *rootPort; + virDomainPCIAddressSet *set = NULL; + if (def->iommu != NULL && def->iommu->model == VIR_DOMAIN_IOMMU_MODEL_NESTED_SMMUV3 && + def->nnestedsmmus > 0) { + for (i = 0; i < def->nhostdevs; i++) { + if (!(smmu_node = nestedSmmuVfioHostdevFound(def->hostdevs[i], addrs->dryRun))) + continue; + /* Find a hostdev and nested SMMU pair */ + for (j = 0; j < def->nnestedsmmus; j++) { + unsigned int nestedSmmuBus = def->nestedsmmus[j]->info->addr.pci.bus; + virDomainControllerDef *pxb; + if (!STREQLEN(def->nestedsmmus[j]->name, smmu_node, strlen(smmu_node))) + continue; + /* Get the hostdev's pcie-root-port controller */ + rootPort = qemuDomainGetUpstreamCont(def, def->hostdevs[i]->info, + VIR_DOMAIN_CONTROLLER_MODEL_PCIE_ROOT_PORT); + /* Skip if already assigned */ + pxb = qemuDomainGetUpstreamCont(def, &rootPort->info, + VIR_DOMAIN_CONTROLLER_MODEL_PCIE_EXPANDER_BUS); + if (pxb) + break; + /* Assign the controller to the next available slot/func on + * the corresponding PXB */ + set = virDomainPCIAddressSetAlloc(nestedSmmuBus + 1, + VIR_PCI_ADDRESS_EXTENSION_NONE); + set->buses[nestedSmmuBus] = addrs->buses[nestedSmmuBus]; + qemuDomainPCIAddressReserveNextAddr(set, &rootPort->info); + break; + } + } + } + return 0; +} + /* * This assigns static PCI slots to all configured devices. * The ordering here is chosen to match the ordering used @@ -2262,6 +2378,18 @@ qemuDomainAssignDevicePCISlots(virDomainDef *def, return -1; } + /* Nested SMMUs */ + if (!addrs->dryRun) { + for (i = 0; i < def->nnestedsmmus; i++) { + if (!virDeviceInfoPCIAddressIsWanted(def->nestedsmmus[i]->info)) + continue; + + if (qemuDomainPCIAddressReserveNextAddr(addrs, + def->nestedsmmus[i]->info) < 0) + return -1; + } + } + /* Host PCI devices */ for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevSubsys *subsys = &def->hostdevs[i]->source.subsys; @@ -2286,6 +2414,12 @@ qemuDomainAssignDevicePCISlots(virDomainDef *def, return -1; } + // Route hostdevs to nested SMMUs + if (!addrs->dryRun) { + if (qemuDomainAssignNestedSmmuv3HostdevSlots(def, addrs) < 0) + return -1; + } + /* memballoon. the qemu driver only accepts virtio memballoon devices */ if (virDomainDefHasMemballoon(def) && virDeviceInfoPCIAddressIsWanted(&def->memballoon->info)) { -- 2.34.1

On Wed, Dec 11, 2024 at 04:24:21PM -0800, Nathan Chen via Devel wrote:
Add a pcie-expander-bus controller to the VM definition for each "nestedSmmuv3" device that is generated when the "nestedSmmuv3" IOMMU model is parsed from the VM definition. Assign each "nestedSmmuv3" device to one PXB controller, and route any unmanaged "hostdev" VFIO devices with associated host SMMU nodes to their corresponding PXB controller based on the "name" attributes of "nestedSmmuv3" devices attached to these PXB controllers.
Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/conf/domain_addr.c | 26 ++++++- src/conf/domain_addr.h | 3 +- src/conf/domain_conf.c | 1 + src/qemu/qemu_domain_address.c | 134 +++++++++++++++++++++++++++++++++ 4 files changed, 162 insertions(+), 2 deletions(-)
diff --git a/src/qemu/qemu_domain_address.c b/src/qemu/qemu_domain_address.c index 31004bfc7e..dee198a7d2 100644 --- a/src/qemu/qemu_domain_address.c +++ b/src/qemu/qemu_domain_address.c @@ -1627,6 +1627,19 @@ qemuDomainPCIAddressSetCreate(virDomainDef *def,
addrs->dryRun = dryRun;
+ /* PXB indices must come before pcie-root-port indices in qemu, + * so add PXB buses to addrs before the pcie-root-ports. */
Can you elaborate on that requirement - I don't get why QEMU cares which order slots are chosen in for PXB vs PCIe-Root-Port devices.
+ + if (addrs->dryRun) { + for (i = 0; i < def->nnestedsmmus; i++) { + if (!virDeviceInfoPCIAddressIsWanted(def->nestedsmmus[i]->info)) + continue; + if (qemuDomainPCIAddressReserveNextAddr(addrs, + def->nestedsmmus[i]->info) < 0) + return NULL; + } + } +
With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Add a pcie-expander-bus controller to the VM definition for each "nestedSmmuv3" device that is generated when the "nestedSmmuv3" IOMMU model is parsed from the VM definition. Assign each "nestedSmmuv3" device to one PXB controller, and route any unmanaged "hostdev" VFIO devices with associated host SMMU nodes to their corresponding PXB controller based on the "name" attributes of "nestedSmmuv3" devices attached to these PXB controllers.
Signed-off-by: Nathan Chen<nathanc@nvidia.com> --- src/conf/domain_addr.c | 26 ++++++- src/conf/domain_addr.h | 3 +- src/conf/domain_conf.c | 1 + src/qemu/qemu_domain_address.c | 134 +++++++++++++++++++++++++++++++++ 4 files changed, 162 insertions(+), 2 deletions(-) diff --git a/src/qemu/qemu_domain_address.c b/src/qemu/qemu_domain_address.c index 31004bfc7e..dee198a7d2 100644 --- a/src/qemu/qemu_domain_address.c +++ b/src/qemu/qemu_domain_address.c @@ -1627,6 +1627,19 @@ qemuDomainPCIAddressSetCreate(virDomainDef *def,
addrs->dryRun = dryRun;
+ /* PXB indices must come before pcie-root-port indices in qemu, + * so add PXB buses to addrs before the pcie-root-ports. */ Can you elaborate on that requirement - I don't get why QEMU cares which order slots are chosen in for PXB vs PCIe-Root-Port devices.
The libvirt controller index describes the order in which bus controllers are encountered; if the pcie-root-port index is less than the PXB index, the pcie-root-port will not find a PXB to attach to and VM creation will fail. To address this, PXBs must be added to the VM definition first and get assigned a lower index than the pcie-root-ports.

On Thu, Jan 30, 2025 at 10:00:21AM -0800, Nathan Chen wrote:
Add a pcie-expander-bus controller to the VM definition for each "nestedSmmuv3" device that is generated when the "nestedSmmuv3" IOMMU model is parsed from the VM definition. Assign each "nestedSmmuv3" device to one PXB controller, and route any unmanaged "hostdev" VFIO devices with associated host SMMU nodes to their corresponding PXB controller based on the "name" attributes of "nestedSmmuv3" devices attached to these PXB controllers.
Signed-off-by: Nathan Chen<nathanc@nvidia.com> --- src/conf/domain_addr.c | 26 ++++++- src/conf/domain_addr.h | 3 +- src/conf/domain_conf.c | 1 + src/qemu/qemu_domain_address.c | 134 +++++++++++++++++++++++++++++++++ 4 files changed, 162 insertions(+), 2 deletions(-) diff --git a/src/qemu/qemu_domain_address.c b/src/qemu/qemu_domain_address.c index 31004bfc7e..dee198a7d2 100644 --- a/src/qemu/qemu_domain_address.c +++ b/src/qemu/qemu_domain_address.c @@ -1627,6 +1627,19 @@ qemuDomainPCIAddressSetCreate(virDomainDef *def, addrs->dryRun = dryRun; + /* PXB indices must come before pcie-root-port indices in qemu, + * so add PXB buses to addrs before the pcie-root-ports. */ Can you elaborate on that requirement - I don't get why QEMU cares which order slots are chosen in for PXB vs PCIe-Root-Port devices.
The libvirt controller index describes the order in which bus controllers are encountered; if the pcie-root-port index is less than the PXB index, the pcie-root-port will not find a PXB to attach to and VM creation will fail. To address this, PXBs must be added to the VM definition first and get assigned a lower index than the pcie-root-ports.
Oh, so you're specifically refering to a case of a pcie-root-port that is attached to a PXB that is listed 2nd in the XML, rather than relative to all possible PXBs. Assigning PXB indexes first fixes the problem in cases where libvirt is the one assigning indexes, but if users manually assign any indexes the problem could still hit. Possibly libvirt needs to be more intelligent about the order in which it specifies the controller devices on the QEMU command line, rather than blindly following XML or index order With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Update the PXB controller busNrs to account for devices we attached in the previous commit, ensuring there are enough VM bus numbers to be assigned for each device attached downstream from each PXB controller. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/qemu/qemu_domain_address.c | 52 ++++++++++++++++++++++++++++++++++ 1 file changed, 52 insertions(+) diff --git a/src/qemu/qemu_domain_address.c b/src/qemu/qemu_domain_address.c index dee198a7d2..001d1ec0b9 100644 --- a/src/qemu/qemu_domain_address.c +++ b/src/qemu/qemu_domain_address.c @@ -2678,6 +2678,23 @@ qemuDomainAddressFindNewTargetIndex(virDomainDef *def) } +static int +qemuDomainFindAttachedDevicesBusNr(virDomainDef *def, + int lowestBusNr, + unsigned int contIdx) +{ + size_t i; + for (i = 0; i < def->ncontrollers; i++) { + virDomainControllerDef *cont = def->controllers[i]; + if (cont->info.addr.pci.bus == contIdx && cont->idx != 0) + lowestBusNr = qemuDomainFindAttachedDevicesBusNr(def, lowestBusNr - 1, cont->idx); + if (lowestBusNr <= 2) + return -1; + } + return lowestBusNr; +} + + static int qemuDomainAddressFindNewBusNr(virDomainDef *def) { @@ -2725,8 +2742,43 @@ qemuDomainAddressFindNewBusNr(virDomainDef *def) */ size_t i; + size_t lowestBusNrContIdx = 0; int lowestBusNr = 256; + if (def->iommu && def->iommu->model && + def->iommu->model == VIR_DOMAIN_IOMMU_MODEL_NESTED_SMMUV3) { + for (i = 0; i < def->ncontrollers; i++) { + virDomainControllerDef *cont = def->controllers[i]; + if (cont->type == VIR_DOMAIN_CONTROLLER_TYPE_PCI) { + int thisBusNr = cont->opts.pciopts.busNr; + if (thisBusNr >= 0 && thisBusNr < lowestBusNr) { + lowestBusNr = thisBusNr; + lowestBusNrContIdx = i; + } + } + } + if (lowestBusNr <= 2) + return -1; + if (lowestBusNrContIdx == 0) { + for (i = 0; i < def->ncontrollers; i++) { + if (def->controllers[i]->model == VIR_DOMAIN_CONTROLLER_MODEL_PCIE_EXPANDER_BUS) { + lowestBusNrContIdx = i; + break; + } + } + } else { + for (i = lowestBusNrContIdx + 1; i < def->ncontrollers; i++) { + if (def->controllers[i]->model == VIR_DOMAIN_CONTROLLER_MODEL_PCIE_EXPANDER_BUS) { + lowestBusNrContIdx = i; + break; + } + } + } + lowestBusNr = qemuDomainFindAttachedDevicesBusNr(def, lowestBusNr, + def->controllers[lowestBusNrContIdx]->idx); + return lowestBusNr - 2; + } + for (i = 0; i < def->ncontrollers; i++) { virDomainControllerDef *cont = def->controllers[i]; -- 2.34.1

Implement a sub-test in qemuxmlconftest that mocks the scanning of sysfs to determine how many "nestedSmmuv3" devices to assign to the VM when "nestedSmmuv3" IOMMU model is parsed from the VM definition. Add a VM definition example with an associated qemu command line that exercises the detection of "nestedSmmuv3" IOMMU model and auto-creation of PXB controllers and "nestedSmmuv3" devices. Signed-off-by: Nathan Chen <nathanc@nvidia.com> --- src/conf/domain_conf.c | 2 +- src/conf/domain_conf.h | 2 + src/libvirt_private.syms | 2 + tests/meson.build | 1 + .../iommu-nestedsmmuv3.aarch64-latest.args | 38 ++++++++++++ .../iommu-nestedsmmuv3.aarch64-latest.xml | 61 +++++++++++++++++++ tests/qemuxmlconfdata/iommu-nestedsmmuv3.xml | 29 +++++++++ tests/qemuxmlconftest.c | 4 +- tests/virnestedsmmuv3mock.c | 57 +++++++++++++++++ 9 files changed, 194 insertions(+), 2 deletions(-) create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.args create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.xml create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.xml create mode 100644 tests/virnestedsmmuv3mock.c diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 46f9b9b0cf..d8b473854e 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -22491,7 +22491,7 @@ virDomainDefAddImplicitVideo(virDomainDef *def, virDomainXMLOption *xmlopt) return 0; } -static int +int virDomainDefAddImplicitNestedSmmuv3(virDomainDef *def) { // Get the number of host-level SMMUv3 instances diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index edde9b63d6..8e7e8166e3 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -3873,6 +3873,8 @@ bool virDomainDefCheckABIStabilityFlags(virDomainDef *src, int virDomainDefAddImplicitDevices(virDomainDef *def, virDomainXMLOption *xmlopt); +int virDomainDefAddImplicitNestedSmmuv3(virDomainDef *def) G_NO_INLINE; + virDomainIOThreadIDDef *virDomainIOThreadIDFind(const virDomainDef *def, unsigned int iothread_id); virDomainIOThreadIDDef *virDomainIOThreadIDAdd(virDomainDef *def, diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index c931003fad..2527f0844a 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -312,6 +312,7 @@ virDomainCryptoTypeTypeFromString; virDomainCryptoTypeTypeToString; virDomainDefAddController; virDomainDefAddImplicitDevices; +virDomainDefAddImplicitNestedSmmuv3; virDomainDefAddUSBController; virDomainDefCheckABIStability; virDomainDefCheckABIStabilityFlags; @@ -531,6 +532,7 @@ virDomainMemorySourceTypeFromString; virDomainMemorySourceTypeToString; virDomainMouseModeTypeFromString; virDomainMouseModeTypeToString; +virDomainNestedSmmuv3DefNew; virDomainNetAllocateActualDevice; virDomainNetAppendIPAddress; virDomainNetARPInterfaces; diff --git a/tests/meson.build b/tests/meson.build index 0d76d37959..d3b9d4612b 100644 --- a/tests/meson.build +++ b/tests/meson.build @@ -177,6 +177,7 @@ if conf.has('WITH_QEMU') { 'name': 'qemuhotplugmock', 'link_with': [ test_qemu_driver_lib, test_utils_qemu_lib, test_utils_lib ] }, { 'name': 'qemuxml2argvmock' }, { 'name': 'virhostidmock' }, + { 'name': 'virnestedsmmuv3mock' }, ] else test_qemu_driver_lib = [] diff --git a/tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.args b/tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.args new file mode 100644 index 0000000000..bf094dd20a --- /dev/null +++ b/tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.args @@ -0,0 +1,38 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/var/lib/libvirt/qemu/domain--1-guest \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/var/lib/libvirt/qemu/domain--1-guest/.local/share \ +XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain--1-guest/.cache \ +XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain--1-guest/.config \ +/usr/bin/qemu-system-aarch64 \ +-name guest=guest,debug-threads=on \ +-S \ +-object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain--1-guest/master-key.aes"}' \ +-machine virt,usb=off,gic-version=2,dump-guest-core=off,memory-backend=mach-virt.ram,acpi=off \ +-accel tcg \ +-cpu cortex-a15 \ +-m size=1048576k \ +-object '{"qom-type":"memory-backend-ram","id":"mach-virt.ram","size":1073741824}' \ +-overcommit mem-lock=off \ +-smp 1,sockets=1,cores=1,threads=1 \ +-uuid 1ccfd97d-5eb4-478a-bbe6-88d254c16db7 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,fd=1729,server=on,wait=off \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-boot strict=on \ +-device '{"driver":"pxb-pcie","bus_nr":254,"id":"pci.1","bus":"pcie.0","addr":"0x1"}' \ +-device '{"driver":"pxb-pcie","bus_nr":252,"id":"pci.2","bus":"pcie.0","addr":"0x2"}' \ +-device '{"driver":"pxb-pcie","bus_nr":250,"id":"pci.3","bus":"pcie.0","addr":"0x3"}' \ +-device '{"driver":"pcie-root-port","port":32,"chassis":4,"id":"pci.4","bus":"pcie.0","addr":"0x4"}' \ +-audiodev '{"id":"audio1","driver":"none"}' \ +-device '{"driver":"arm-smmuv3-nested","bus":"pci.1"}' \ +-device '{"driver":"arm-smmuv3-nested","bus":"pci.2"}' \ +-device '{"driver":"arm-smmuv3-nested","bus":"pci.3"}' \ +-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ +-msg timestamp=on diff --git a/tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.xml b/tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.xml new file mode 100644 index 0000000000..0aa9296eaa --- /dev/null +++ b/tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.xml @@ -0,0 +1,61 @@ +<domain type='qemu'> + <name>guest</name> + <uuid>1ccfd97d-5eb4-478a-bbe6-88d254c16db7</uuid> + <memory unit='KiB'>1048576</memory> + <currentMemory unit='KiB'>1048576</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='aarch64' machine='virt'>hvm</type> + <boot dev='hd'/> + </os> + <features> + <gic version='2'/> + </features> + <cpu mode='custom' match='exact' check='none'> + <model fallback='forbid'>cortex-a15</model> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-aarch64</emulator> + <controller type='usb' index='0' model='none'/> + <controller type='pci' index='0' model='pcie-root'/> + <controller type='pci' index='1' model='pcie-expander-bus'> + <model name='pxb-pcie'/> + <target busNr='254'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> + </controller> + <controller type='pci' index='2' model='pcie-expander-bus'> + <model name='pxb-pcie'/> + <target busNr='252'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> + </controller> + <controller type='pci' index='3' model='pcie-expander-bus'> + <model name='pxb-pcie'/> + <target busNr='250'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </controller> + <controller type='pci' index='4' model='pcie-root-port'> + <model name='pcie-root-port'/> + <target chassis='4' port='0x20'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> + </controller> + <audio id='1' type='none'/> + <memballoon model='none'/> + <nestedSmmuv3> + <name>smmu3.0x0000000000000001</name> + <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> + </nestedSmmuv3> + <nestedSmmuv3> + <name>smmu3.0x0000000000000002</name> + <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> + </nestedSmmuv3> + <nestedSmmuv3> + <name>smmu3.0x0000000000000003</name> + <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/> + </nestedSmmuv3> + <iommu model='nestedSmmuv3'/> + </devices> +</domain> diff --git a/tests/qemuxmlconfdata/iommu-nestedsmmuv3.xml b/tests/qemuxmlconfdata/iommu-nestedsmmuv3.xml new file mode 100644 index 0000000000..c114a3675c --- /dev/null +++ b/tests/qemuxmlconfdata/iommu-nestedsmmuv3.xml @@ -0,0 +1,29 @@ +<domain type='qemu'> + <name>guest</name> + <uuid>1ccfd97d-5eb4-478a-bbe6-88d254c16db7</uuid> + <memory unit='KiB'>1048576</memory> + <currentMemory unit='KiB'>1048576</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='aarch64' machine='virt'>hvm</type> + <boot dev='hd'/> + </os> + <features> + <gic version='2'/> + </features> + <cpu mode='custom' match='exact' check='none'> + <model fallback='forbid'>cortex-a15</model> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-aarch64</emulator> + <controller type='usb' index='0' model='none'/> + <controller type='pci' index='0' model='pcie-root'/> + <audio id='1' type='none'/> + <memballoon model='none'/> + <iommu model='nestedSmmuv3'/> + </devices> +</domain> diff --git a/tests/qemuxmlconftest.c b/tests/qemuxmlconftest.c index f3c8d0ae34..01176f42b7 100644 --- a/tests/qemuxmlconftest.c +++ b/tests/qemuxmlconftest.c @@ -2772,6 +2772,7 @@ mymain(void) DO_TEST_CAPS_LATEST_PARSE_ERROR("intel-iommu-wrong-machine"); DO_TEST_CAPS_LATEST_ABI_UPDATE("intel-iommu-eim-autoadd"); DO_TEST_CAPS_ARCH_LATEST("iommu-smmuv3", "aarch64"); + DO_TEST_CAPS_ARCH_LATEST("iommu-nestedsmmuv3", "aarch64"); DO_TEST_CAPS_LATEST("virtio-iommu-x86_64"); DO_TEST_CAPS_VER_PARSE_ERROR("virtio-iommu-x86_64", "6.1.0"); DO_TEST_CAPS_ARCH_LATEST("virtio-iommu-aarch64", "aarch64"); @@ -3030,7 +3031,8 @@ VIR_TEST_MAIN_PRELOAD(mymain, VIR_TEST_MOCK("virrandom"), VIR_TEST_MOCK("qemucpu"), VIR_TEST_MOCK("virpci"), - VIR_TEST_MOCK("virnuma")) + VIR_TEST_MOCK("virnuma"), + VIR_TEST_MOCK("virnestedsmmuv3")) #else diff --git a/tests/virnestedsmmuv3mock.c b/tests/virnestedsmmuv3mock.c new file mode 100644 index 0000000000..1f40ef46f4 --- /dev/null +++ b/tests/virnestedsmmuv3mock.c @@ -0,0 +1,57 @@ +/* + * virnestedsmmuv3mock.c: Mock sysfs nested SMMU paths + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library. If not, see + * <http://www.gnu.org/licenses/>. + */ + +#include <config.h> + +#include "conf/domain_conf.h" +#include "viralloc.h" +#include "virstring.h" +#include "internal.h" +#include "virmock.h" + +int +virDomainDefAddImplicitNestedSmmuv3(virDomainDef *def) +{ + virDomainNestedSmmuv3Def* nestedsmmuv3 = NULL; + + if (!(nestedsmmuv3 = virDomainNestedSmmuv3DefNew())) + return -1; + VIR_REALLOC_N(nestedsmmuv3->name, strlen("smmu3.0x0000000000000001") + 1); + virStrcpy(nestedsmmuv3->name, "smmu3.0x0000000000000001", + strlen("smmu3.0x0000000000000001") + 1); + VIR_REALLOC_N(def->nestedsmmus, def->nnestedsmmus + 1); + def->nestedsmmus[def->nnestedsmmus++] = nestedsmmuv3; + + if (!(nestedsmmuv3 = virDomainNestedSmmuv3DefNew())) + return -1; + VIR_REALLOC_N(nestedsmmuv3->name, strlen("smmu3.0x0000000000000002") + 1); + virStrcpy(nestedsmmuv3->name, "smmu3.0x0000000000000002", + strlen("smmu3.0x0000000000000002") + 1); + VIR_REALLOC_N(def->nestedsmmus, def->nnestedsmmus + 1); + def->nestedsmmus[def->nnestedsmmus++] = nestedsmmuv3; + + if (!(nestedsmmuv3 = virDomainNestedSmmuv3DefNew())) + return -1; + VIR_REALLOC_N(nestedsmmuv3->name, strlen("smmu3.0x0000000000000003") + 1); + virStrcpy(nestedsmmuv3->name, "smmu3.0x0000000000000003", + strlen("smmu3.0x0000000000000003") + 1); + VIR_REALLOC_N(def->nestedsmmus, def->nnestedsmmus + 1); + def->nestedsmmus[def->nnestedsmmus++] = nestedsmmuv3; + + return 0; +} -- 2.34.1

On Wed, Dec 11, 2024 at 04:24:18PM -0800, Nathan Chen via Devel wrote:
Hi,
This is a draft solution for supporting multiple vSMMU instances in a qemu VM.
Based on discussions/suggestions received for a previous RFC by Nicolin here[0], the association of vSMMUs to VFIO devices in VM PCIe topology should be moved out of qemu into libvirt. In addition, the nested SMMU nodes should be passed to qemu as pluggable devices.
To address these changes, this patch series introduces a new "nestedSmmuv3" IOMMU model and "nestedSmmuv3" device type. Upon specifying the nestedSmmuv3 IOMMU model, nestedSmmuv3 devices will be auto-added to the VM definition based on the available SMMU nodes in the host's sysfs. The nestedSmmuv3 devices will each be attached to a separate PXB controller, and VFIO devices will be routed to PXBs based on their association with host SMMU nodes. This will maintain a VM PCIe topology that allows for multiple nested SMMUs per Nicolin's original qemu patch series in [0] and Shameer's work in [1] to remove VM topology changes from qemu and allow the nested SMMUs to be specified as pluggable devices.
For instance, if we specify the nestedSmmuv3 IOMMU model and a hostdev for passthrough:
<devices> <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> </hostdev> <iommu model='nestedSmmuv3'/> </devices>
Libvirt will scan sysfs and populate the VM definition with controllers and nestedSmmuv3 devices based on host config. So if /sys/bus/pci/devices/0009:01:00.0/iommu is a symlink to the host SMMU node represented by /sys/devices/platform/arm-smmu-v3.8.auto/iommu/smmu3.0x0000000016000000 and there are 3 host SMMU nodes under /sys/class/iommu/, we'll see three auto-added nestedSmmuv3 devices, each routed to a pcie-expander-bus controller. Then the hostdev will be routed to a PXB controller that has a matching host SMMU node associated with it:
<devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='254'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='2' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='251'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> <controller type='pci' index='3' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='249'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </controller> <controller type='pci' index='4' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='7' port='0x8'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/> </controller> <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </hostdev> <iommu model='nestedSmmuv3'/> <nestedSmmuv3> <name>smmu3.0x0000000012000000</name> <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </nestedSmmuv3> <nestedSmmuv3> <name>smmu3.0x0000000016000000</name> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </nestedSmmuv3> <nestedSmmuv3> <name>smmu3.0x0000000011000000</name> <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/> </nestedSmmuv3> <iommu model='nestedSmmuv3'/> </devices>
Top level libvirt device representation in XML is based on the device *class*, not the specific device impl. Adding a <nestedSmmuv3> device type XML element in libvirt is totally inappropriate. Any configuration must be done beneath the <iommu> element. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Top level libvirt device representation in XML is based on the device *class*, not the specific device impl. Adding a <nestedSmmuv3> device type XML element in libvirt is totally inappropriate. Any configuration must be done beneath the <iommu> element.
Hi Daniel, Thanks for the feedback - I will remove the <nestedSmmuv3> device type in the next iteration. I had implemented it in order to associate PXB controllers with host SMMU nodes, but instead I will keep track of host SMMU node names in a new PXB controller attribute. Best, Nathan

Hi Daniel,
Top level libvirt device representation in XML is based on the device *class*, not the specific device impl. Adding a <nestedSmmuv3> device type XML element in libvirt is totally inappropriate. Any configuration must be done beneath the <iommu> element.
Would keeping track of PXB <=> host SMMU nodes be better represented with a <nestedSmmuv3> PXB attribute like below, when the "nestedSmmuv3" IOMMU model is specified? This method would be simplest IMO because we could omit keeping track of the nestedSmmuv3 bus number in the virDomainDef struct since its association with a PXB controller would already be baked-in. <devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='254'/> <nestedSmmuv3>smmu3.0x0000000012000000</nestedSmmuv3> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> ... <iommu model='nestedSmmuv3'/> </devices> Or would it still be preferred to purely contain the host SMMU node names and bus numbers within the <iommu> element? If so, the virDomainDef struct is setup to only have a single virDomainIOMMUDef member, but we need to keep track of multiple host SMMU node names and bus numbers. Would we setup multiple virDomainIOMMUDef members? Or add "char** nestedSmmuv3" and "size_t *nestedSmmuv3Bus" members to the virDomainIOMMUDef struct to keep track of these multiple host SMMU node names and bus numbers? Or we could modify the device info member of virDomainIOMMUDef to be a variable-length array of device info structs instead of the "size_t *nestedSmmuv3Bus" member. But I'm not convinced this would be the cleanest approach, and the qemu command line doesn't specify a slot and function to arm-smmuv3-nested devices - it just specifies bus number to keep track of which PXB is associated with each SMMU node. If we need a way to associate PXB to SMMU node, we have multiple possible approaches, listed below in order of best to worst in my opinion: 1. Adding a <nestedSmmuv3> attribute for PXB controller. 2. Having a single virDomainIOMMUDef struct for virDomainDef. Adding variable-length array members to virDomainIOMMUDef for multiple SMMU node names and bus numbers. 3. Having a single virDomainIOMMUDef struct for virDomainDef. Adding a variable-length array member to virDomainIOMMUDef for SMMU node names. Changing the single virDomainDeviceInfo struct to a variable-length array of virDomainDeviceInfo structs for multiple nested SMMU bus numbers. 4. Supporting multiple virDomainIOMMUDef structs for virDomainDef. I would appreciate your thoughts on which method to go with. Thanks, Nathan

On Sun, Dec 15, 2024 at 11:45:56AM -0800, Nathan Chen wrote:
Hi Daniel,
Top level libvirt device representation in XML is based on the device *class*, not the specific device impl. Adding a <nestedSmmuv3> device type XML element in libvirt is totally inappropriate. Any configuration must be done beneath the <iommu> element.
Would keeping track of PXB <=> host SMMU nodes be better represented with a <nestedSmmuv3> PXB attribute like below, when the "nestedSmmuv3" IOMMU model is specified? This method would be simplest IMO because we could omit keeping track of the nestedSmmuv3 bus number in the virDomainDef struct since its association with a PXB controller would already be baked-in.
<devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='254'/> <nestedSmmuv3>smmu3.0x0000000012000000</nestedSmmuv3> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> ... <iommu model='nestedSmmuv3'/> </devices>
Or would it still be preferred to purely contain the host SMMU node names and bus numbers within the <iommu> element? If so, the virDomainDef struct is setup to only have a single virDomainIOMMUDef member, but we need to keep track of multiple host SMMU node names and bus numbers. Would we setup multiple virDomainIOMMUDef members? Or add "char** nestedSmmuv3" and "size_t *nestedSmmuv3Bus" members to the virDomainIOMMUDef struct to keep track of these multiple host SMMU node names and bus numbers?
Or we could modify the device info member of virDomainIOMMUDef to be a variable-length array of device info structs instead of the "size_t *nestedSmmuv3Bus" member. But I'm not convinced this would be the cleanest approach, and the qemu command line doesn't specify a slot and function to arm-smmuv3-nested devices - it just specifies bus number to keep track of which PXB is associated with each SMMU node.
If we need a way to associate PXB to SMMU node, we have multiple possible approaches, listed below in order of best to worst in my opinion:
1. Adding a <nestedSmmuv3> attribute for PXB controller. 2. Having a single virDomainIOMMUDef struct for virDomainDef. Adding variable-length array members to virDomainIOMMUDef for multiple SMMU node names and bus numbers. 3. Having a single virDomainIOMMUDef struct for virDomainDef. Adding a variable-length array member to virDomainIOMMUDef for SMMU node names. Changing the single virDomainDeviceInfo struct to a variable-length array of virDomainDeviceInfo structs for multiple nested SMMU bus numbers. 4. Supporting multiple virDomainIOMMUDef structs for virDomainDef.
I would appreciate your thoughts on which method to go with.
I'm finding it a little hard to give a recommendation, as I'm not confident I fully understand the relationship between all the pieces. Ignoring the specific QEMU impl, can you give a general outline of the relationship between host SMMU(s), guest SMMU(s) and PCI controllers, especially the M:N values of the relations. Also, if this is getting associated with the host SMMU in some way, does this have implications for live migration compatibility ? With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Hi Daniel,
Top level libvirt device representation in XML is based on the device *class*, not the specific device impl. Adding a <nestedSmmuv3> device type XML element in libvirt is totally inappropriate. Any configuration must be done beneath the <iommu> element.
Would keeping track of PXB <=> host SMMU nodes be better represented with a <nestedSmmuv3> PXB attribute like below, when the "nestedSmmuv3" IOMMU model is specified? This method would be simplest IMO because we could omit keeping track of the nestedSmmuv3 bus number in the virDomainDef struct since its association with a PXB controller would already be baked-in.
<devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='254'/> <nestedSmmuv3>smmu3.0x0000000012000000</nestedSmmuv3> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> ... <iommu model='nestedSmmuv3'/> </devices>
Or would it still be preferred to purely contain the host SMMU node names and bus numbers within the <iommu> element? If so, the virDomainDef struct is setup to only have a single virDomainIOMMUDef member, but we need to keep track of multiple host SMMU node names and bus numbers. Would we setup multiple virDomainIOMMUDef members? Or add "char** nestedSmmuv3" and "size_t *nestedSmmuv3Bus" members to the virDomainIOMMUDef struct to keep track of these multiple host SMMU node names and bus numbers?
Or we could modify the device info member of virDomainIOMMUDef to be a variable-length array of device info structs instead of the "size_t *nestedSmmuv3Bus" member. But I'm not convinced this would be the cleanest approach, and the qemu command line doesn't specify a slot and function to arm-smmuv3-nested devices - it just specifies bus number to keep track of which PXB is associated with each SMMU node.
If we need a way to associate PXB to SMMU node, we have multiple possible approaches, listed below in order of best to worst in my opinion:
1. Adding a <nestedSmmuv3> attribute for PXB controller. 2. Having a single virDomainIOMMUDef struct for virDomainDef. Adding variable-length array members to virDomainIOMMUDef for multiple SMMU node names and bus numbers. 3. Having a single virDomainIOMMUDef struct for virDomainDef. Adding a variable-length array member to virDomainIOMMUDef for SMMU node names. Changing the single virDomainDeviceInfo struct to a variable-length array of virDomainDeviceInfo structs for multiple nested SMMU bus numbers. 4. Supporting multiple virDomainIOMMUDef structs for virDomainDef.
I would appreciate your thoughts on which method to go with.
Ignoring the specific QEMU impl, can you give a general outline of the relationship between host SMMU(s), guest SMMU(s) and PCI controllers, especially the M:N values of the relations.
Guest SMMU (vSMMU) allows stage 1 translation use cases for the VM and VM-assigned devices. For the multiple vSMMU design, a 1:1 ratio between host SMMUs and vSMMUs is implemented where commands issued to each vSMMU are translated by the hypervisor and forwarded to the corresponding host SMMU. The 1:1 ratio between vSMMUs and guest PCI controllers that directly plug into the PCIe root controller (i.e. PXBs) is for the specific QEMU implementation to make it easier to establish the mapping of which passthrough device to which SMMU in the ACPI I/O remapping table while avoiding associating emulated devices to vSMMUs [0].
Also, if this is getting associated with the host SMMU in some way, does this have implications for live migration compatibility ?
Live migration of devices associated with vIOMMU is currently blocked, per [1] and [2] below. Joao is working on it here [3]. [0] https://lore.kernel.org/qemu-devel/bc7a57311ac4976699789ceca329edfdfe823c2d.... [1] https://github.com/qemu/qemu/commit/3c26c80a0a269ce7870d1475e756607e939226cd [2] https://lore.kernel.org/qemu-devel/3fa6f093ff9a4749bcd25d0dfa60b1d7@huawei.c... [3] https://lore.kernel.org/all/20230622214845.3980-1-joao.m.martins@oracle.com/ Thanks, Nathan

Hi Daniel,
Top level libvirt device representation in XML is based on the device *class*, not the specific device impl. Adding a <nestedSmmuv3> device type XML element in libvirt is totally inappropriate. Any configuration must be done beneath the <iommu> element.
1. Adding a <nestedSmmuv3> attribute for PXB controller. 2. Having a single virDomainIOMMUDef struct for virDomainDef. Adding variable-length array members to virDomainIOMMUDef for multiple SMMU node names and bus numbers. 3. Having a single virDomainIOMMUDef struct for virDomainDef. Adding a variable-length array member to virDomainIOMMUDef for SMMU node names. Changing the single virDomainDeviceInfo struct to a variable-length array of virDomainDeviceInfo structs for multiple nested SMMU bus numbers. 4. Supporting multiple virDomainIOMMUDef structs for virDomainDef.
Following up here - would you have a recommendation on which implementation to proceed with? I will proceed with adding a <nestedSmmuv3> attribute for PXB controller and removing the <nestedSmmuv3> device stanzas for the next RFC, if there is no preference for one method over the other. Thanks, Nathan

On Sun, Dec 15, 2024 at 11:45:56AM -0800, Nathan Chen wrote:
Hi Daniel,
Top level libvirt device representation in XML is based on the device *class*, not the specific device impl. Adding a <nestedSmmuv3> device type XML element in libvirt is totally inappropriate. Any configuration must be done beneath the <iommu> element.
Would keeping track of PXB <=> host SMMU nodes be better represented with a <nestedSmmuv3> PXB attribute like below, when the "nestedSmmuv3" IOMMU model is specified? This method would be simplest IMO because we could omit keeping track of the nestedSmmuv3 bus number in the virDomainDef struct since its association with a PXB controller would already be baked-in.
<devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='254'/> <nestedSmmuv3>smmu3.0x0000000012000000</nestedSmmuv3> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> ... <iommu model='nestedSmmuv3'/> </devices>
This doesn't make sense, as there's no association defined between the PXB <controller> and the <iommu> device. I would expect th <iommu> to have a 'bus' attribute referring to the <controller> 'index' attribute value. The '<nestedSmmuv3>smmu3.0x0000000012000000</nestedSmmuv3>' bit is also wierd, because it never appears in the QEMU command line at all. Some kind of implicit association seems to be taking place behind the scenes and that is unlikely to be a desirable thing. We need to model associations between devices explicitly and directly pass this on to QEMU. If the nestedSmmuv3 is assocaited with a host SMMUv3 device, then that association should be represented on the <iommu> device, and this mapping passed to QEMU.
Or would it still be preferred to purely contain the host SMMU node names and bus numbers within the <iommu> element? If so, the virDomainDef struct is setup to only have a single virDomainIOMMUDef member, but we need to keep track of multiple host SMMU node names and bus numbers. Would we setup multiple virDomainIOMMUDef members? Or add "char** nestedSmmuv3" and "size_t *nestedSmmuv3Bus" members to the virDomainIOMMUDef struct to keep track of these multiple host SMMU node names and bus numbers?
Or we could modify the device info member of virDomainIOMMUDef to be a variable-length array of device info structs instead of the "size_t *nestedSmmuv3Bus" member. But I'm not convinced this would be the cleanest approach, and the qemu command line doesn't specify a slot and function to arm-smmuv3-nested devices - it just specifies bus number to keep track of which PXB is associated with each SMMU node.
If we need a way to associate PXB to SMMU node, we have multiple possible approaches, listed below in order of best to worst in my opinion:
1. Adding a <nestedSmmuv3> attribute for PXB controller. 2. Having a single virDomainIOMMUDef struct for virDomainDef. Adding variable-length array members to virDomainIOMMUDef for multiple SMMU node names and bus numbers. 3. Having a single virDomainIOMMUDef struct for virDomainDef. Adding a variable-length array member to virDomainIOMMUDef for SMMU node names. Changing the single virDomainDeviceInfo struct to a variable-length array of virDomainDeviceInfo structs for multiple nested SMMU bus numbers. 4. Supporting multiple virDomainIOMMUDef structs for virDomainDef.
We need (4). Each device must be represented as a distinct virDomainIOMMUDef, and then associated to the controller and host SMMU. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Hi Daniel,
Top level libvirt device representation in XML is based on the device *class*, not the specific device impl. Adding a <nestedSmmuv3> device type XML element in libvirt is totally inappropriate. Any configuration must be done beneath the <iommu> element. Would keeping track of PXB <=> host SMMU nodes be better represented with a <nestedSmmuv3> PXB attribute like below, when the "nestedSmmuv3" IOMMU model is specified? This method would be simplest IMO because we could omit keeping track of the nestedSmmuv3 bus number in the virDomainDef struct since its association with a PXB controller would already be baked-in.
<devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='254'/> <nestedSmmuv3>smmu3.0x0000000012000000</nestedSmmuv3> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> ... <iommu model='nestedSmmuv3'/> </devices>
The '<nestedSmmuv3>smmu3.0x0000000012000000</nestedSmmuv3>' bit is also wierd, because it never appears in the QEMU command line at all. Some kind of implicit association seems to be taking place behind the scenes and that is unlikely to be a desirable thing. We need to model associations between devices explicitly and directly pass this on to QEMU.
If the nestedSmmuv3 is assocaited with a host SMMUv3 device, then that association should be represented on the <iommu> device, and this mapping passed to QEMU.
The '<nestedSmmuv3>smmu3.0x0000000012000000</nestedSmmuv3>' bit will be moved to the <iommu> definition in the next revision. I agree that explicitly specifying the host SMMU node name in QEMU would make the host to guest SMMU association more clear. But the current QEMU implementation appears more flexible in allowing us to create nestedSmmuv3 instances and then imply the association to a host SMMU node based on the PCIe topology isolating devices under different PXBs. I think we should discuss this further on the QEMU thread with Shameer before he sends out the second revision of his series, then adjust the libvirt implementation based on the next QEMU revision.
Or would it still be preferred to purely contain the host SMMU node names and bus numbers within the <iommu> element? If so, the virDomainDef struct is setup to only have a single virDomainIOMMUDef member, but we need to keep track of multiple host SMMU node names and bus numbers. Would we setup multiple virDomainIOMMUDef members? Or add "char** nestedSmmuv3" and "size_t *nestedSmmuv3Bus" members to the virDomainIOMMUDef struct to keep track of these multiple host SMMU node names and bus numbers?
Or we could modify the device info member of virDomainIOMMUDef to be a variable-length array of device info structs instead of the "size_t *nestedSmmuv3Bus" member. But I'm not convinced this would be the cleanest approach, and the qemu command line doesn't specify a slot and function to arm-smmuv3-nested devices - it just specifies bus number to keep track of which PXB is associated with each SMMU node.
If we need a way to associate PXB to SMMU node, we have multiple possible approaches, listed below in order of best to worst in my opinion:
1. Adding a <nestedSmmuv3> attribute for PXB controller. 2. Having a single virDomainIOMMUDef struct for virDomainDef. Adding variable-length array members to virDomainIOMMUDef for multiple SMMU node names and bus numbers. 3. Having a single virDomainIOMMUDef struct for virDomainDef. Adding a variable-length array member to virDomainIOMMUDef for SMMU node names. Changing the single virDomainDeviceInfo struct to a variable-length array of virDomainDeviceInfo structs for multiple nested SMMU bus numbers. 4. Supporting multiple virDomainIOMMUDef structs for virDomainDef. We need (4). Each device must be represented as a distinct virDomainIOMMUDef, and then associated to the controller and host SMMU.
Makes sense, I will proceed with this in the next revision. Thanks!

On Thu, Jan 30, 2025 at 09:51:55AM -0800, Nathan Chen wrote:
Hi Daniel,
Top level libvirt device representation in XML is based on the device *class*, not the specific device impl. Adding a <nestedSmmuv3> device type XML element in libvirt is totally inappropriate. Any configuration must be done beneath the <iommu> element. Would keeping track of PXB <=> host SMMU nodes be better represented with a <nestedSmmuv3> PXB attribute like below, when the "nestedSmmuv3" IOMMU model is specified? This method would be simplest IMO because we could omit keeping track of the nestedSmmuv3 bus number in the virDomainDef struct since its association with a PXB controller would already be baked-in.
<devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='254'/> <nestedSmmuv3>smmu3.0x0000000012000000</nestedSmmuv3> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> ... <iommu model='nestedSmmuv3'/> </devices>
The '<nestedSmmuv3>smmu3.0x0000000012000000</nestedSmmuv3>' bit is also wierd, because it never appears in the QEMU command line at all. Some kind of implicit association seems to be taking place behind the scenes and that is unlikely to be a desirable thing. We need to model associations between devices explicitly and directly pass this on to QEMU.
If the nestedSmmuv3 is assocaited with a host SMMUv3 device, then that association should be represented on the <iommu> device, and this mapping passed to QEMU.
The '<nestedSmmuv3>smmu3.0x0000000012000000</nestedSmmuv3>' bit will be moved to the <iommu> definition in the next revision.
I agree that explicitly specifying the host SMMU node name in QEMU would make the host to guest SMMU association more clear. But the current QEMU implementation appears more flexible in allowing us to create nestedSmmuv3 instances and then imply the association to a host SMMU node based on the PCIe topology isolating devices under different PXBs. I think we should discuss this further on the QEMU thread with Shameer before he sends out the second revision of his series, then adjust the libvirt implementation based on the next QEMU revision.
"auto-magic" associations between different pieces has too large a scope to bite us at a later date. libvirt's goal has always been to make everything that's functionally impacting a guest device be 100% explicit. So I don't think we should be implying mappings to the host SMMU in QEMU at all, QEMU must be told what to map to. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

-----Original Message----- From: Nathan Chen <nathanc@nvidia.com> Sent: Thursday, January 30, 2025 5:52 PM To: Daniel P. Berrangé <berrange@redhat.com> Cc: devel@lists.libvirt.org; Nicolin Chen <nicolinc@nvidia.com>; Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> Subject: Re: [RFC PATCH 0/5] qemu: Route hostdevs to multiple nested SMMUs
Hi Daniel,
Top level libvirt device representation in XML is based on the device *class*, not the specific device impl. Adding a <nestedSmmuv3> device type XML element in libvirt is totally inappropriate. Any configuration must be done beneath the <iommu> element. Would keeping track of PXB <=> host SMMU nodes be better represented with a <nestedSmmuv3> PXB attribute like below, when the "nestedSmmuv3" IOMMU model is specified? This method would be simplest IMO because we could omit keeping track of the nestedSmmuv3 bus number in the virDomainDef struct since its association with a PXB controller would already be baked-in.
<devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='254'/> <nestedSmmuv3>smmu3.0x0000000012000000</nestedSmmuv3> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> ... <iommu model='nestedSmmuv3'/> </devices>
The '<nestedSmmuv3>smmu3.0x0000000012000000</nestedSmmuv3>' bit is also wierd, because it never appears in the QEMU command line at all. Some kind of implicit association seems to be taking place behind the scenes and that is unlikely to be a desirable thing. We need to model associations between devices explicitly and directly pass this on to QEMU.
If the nestedSmmuv3 is assocaited with a host SMMUv3 device, then that association should be represented on the <iommu> device, and this mapping passed to QEMU.
The '<nestedSmmuv3>smmu3.0x0000000012000000</nestedSmmuv3>' bit will be moved to the <iommu> definition in the next revision.
I agree that explicitly specifying the host SMMU node name in QEMU would make the host to guest SMMU association more clear. But the current QEMU implementation appears more flexible in allowing us to create nestedSmmuv3 instances and then imply the association to a host SMMU node based on the PCIe topology isolating devices under different PXBs. I think we should discuss this further on the QEMU thread with Shameer before he sends out the second revision of his series, then adjust the libvirt implementation based on the next QEMU revision.
I just replied to Daniel's same comment on that thread .Sorry forgot to CC you. It's here, https://lore.kernel.org/qemu-devel/7ecabe74e0514367baf28d67675e5db8@huawei.c... Thanks, Shameer.
participants (4)
-
Daniel P. Berrangé
-
Nathan Chen
-
nathanc@nvidia.com
-
Shameerali Kolothum Thodi