Hi,
This is a draft solution for supporting multiple vSMMU instances in a qemu VM.
Based on discussions/suggestions received for a previous RFC by Nicolin here[0],
the association of vSMMUs to VFIO devices in VM PCIe topology should be moved
out of qemu into libvirt. In addition, the nested SMMU nodes should be passed
to qemu as pluggable devices.
To address these changes, this patch series introduces a new "nestedSmmuv3"
IOMMU model and "nestedSmmuv3" device type. Upon specifying the nestedSmmuv3
IOMMU model, nestedSmmuv3 devices will be auto-added to the VM definition based
on the available SMMU nodes in the host's sysfs. The nestedSmmuv3 devices will
each be attached to a separate PXB controller, and VFIO devices will be routed
to PXBs based on their association with host SMMU nodes. This will maintain a VM
PCIe topology that allows for multiple nested SMMUs per Nicolin's original qemu
patch series in [0] and Shameer's work in [1] to remove VM topology changes from
qemu and allow the nested SMMUs to be specified as pluggable devices.
For instance, if we specify the nestedSmmuv3 IOMMU model and a hostdev for
passthrough:
<devices>
<hostdev mode='subsystem' type='pci' managed='no'>
<source>
<address domain='0x0009' bus='0x01' slot='0x00'
function='0x0'/>
</source>
</hostdev>
<iommu model='nestedSmmuv3'/>
</devices>
Libvirt will scan sysfs and populate the VM definition with controllers and
nestedSmmuv3 devices based on host config. So if
/sys/bus/pci/devices/0009:01:00.0/iommu is a symlink to the host SMMU node
represented by
/sys/devices/platform/arm-smmu-v3.8.auto/iommu/smmu3.0x0000000016000000
and there are 3 host SMMU nodes under /sys/class/iommu/, we'll see three
auto-added nestedSmmuv3 devices, each routed to a pcie-expander-bus controller.
Then the hostdev will be routed to a PXB controller that has a matching host
SMMU node associated with it:
<devices>
...
<controller type='pci' index='1'
model='pcie-expander-bus'>
<model name='pxb-pcie'/>
<target busNr='254'/>
<address type='pci' domain='0x0000' bus='0x00'
slot='0x01' function='0x0'/>
</controller>
<controller type='pci' index='2'
model='pcie-expander-bus'>
<model name='pxb-pcie'/>
<target busNr='251'/>
<address type='pci' domain='0x0000' bus='0x00'
slot='0x02' function='0x0'/>
</controller>
<controller type='pci' index='3'
model='pcie-expander-bus'>
<model name='pxb-pcie'/>
<target busNr='249'/>
<address type='pci' domain='0x0000' bus='0x00'
slot='0x03' function='0x0'/>
</controller>
<controller type='pci' index='4'
model='pcie-root-port'>
<model name='pcie-root-port'/>
<target chassis='7' port='0x8'/>
<address type='pci' domain='0x0000' bus='0x02'
slot='0x01' function='0x0'/>
</controller>
<hostdev mode='subsystem' type='pci' managed='no'>
<source>
<address domain='0x0009' bus='0x01' slot='0x00'
function='0x0'/>
</source>
<address type='pci' domain='0x0000' bus='0x04'
slot='0x00' function='0x0'/>
</hostdev>
<iommu model='nestedSmmuv3'/>
<nestedSmmuv3>
<name>smmu3.0x0000000012000000</name>
<address type='pci' domain='0x0000' bus='0x01'
slot='0x00' function='0x0'/>
</nestedSmmuv3>
<nestedSmmuv3>
<name>smmu3.0x0000000016000000</name>
<address type='pci' domain='0x0000' bus='0x02'
slot='0x00' function='0x0'/>
</nestedSmmuv3>
<nestedSmmuv3>
<name>smmu3.0x0000000011000000</name>
<address type='pci' domain='0x0000' bus='0x03'
slot='0x00' function='0x0'/>
</nestedSmmuv3>
<iommu model='nestedSmmuv3'/>
</devices>
TODO:
- No DMA mapping can found by UEFI when specifying multiple passthrough devices
in the VM definition, and VM boot is subsequently blocked. We need to
investigate this for the next revision, but we don't encounter this issue when
passing through a single device. We'll include iommufd support in the next
revision to narrow down whether the required fix would be outside of libvirt.
- Shameer's qemu branch specifies nestedSmmuv3 bus number with "pci-bus"
instead of "bus", so the libvirt compilation test args and qemu args in
qemuBuildPCINestedSmmuv3DevProps() need to be modified to match this revision
of qemu. It will be reverted to using "bus" in the next qemu revision.
- This patchset decrements PXB busNr based on how many devices are attached
downstream, and the libvirt documentation states we must reserve busNr for the
PXB itself in addition to any devices attached downstream. When I launch a VM
and a PXB has a pcie-root-port and hostdev attached downstream, busNrs 253,
252, and 251 are reserved. But the PXB itself already has a bus number
assigned via the <address/> attribute, and I see 253 and 252 assigned to the
hostdev and pcie-root-port in the VM but not 251. Should we decrement busNr
based on libvirt documentation or do we only need two busNrs 253 and 252 in
the example here?
This series is on Github:
https://github.com/NathanChenNVIDIA/libvirt/tree/nested-smmuv3-12-05-24
Thanks,
Nathan
[0]
https://lore.kernel.org/qemu-devel/cover.1719361174.git.nicolinc@nvidia.com/
[1]
https://lore.kernel.org/qemu-devel/20241108125242.60136-1-shameerali.kolo...
Signed-off-by: Nathan Chen <nathanc(a)nvidia.com>
Nathan Chen (5):
conf: Add a nestedSmmuv3 IOMMU model
qemu: Implement and auto-add a nestedSmmuv3 device type
qemu: Create PXBs and auto-assign VFIO devs and nested SMMUs
qemu: Update PXB busNr for nestedSmmuv3 controllers
qemu: Add test case for specifying multiple nested SMMUs
docs/formatdomain.rst | 25 ++-
src/ch/ch_domain.c | 1 +
src/conf/domain_addr.c | 26 ++-
src/conf/domain_addr.h | 4 +-
src/conf/domain_conf.c | 188 +++++++++++++++++
src/conf/domain_conf.h | 15 ++
src/conf/domain_postparse.c | 1 +
src/conf/domain_validate.c | 24 +++
src/conf/schemas/domaincommon.rng | 17 ++
src/conf/virconftypes.h | 2 +
src/libvirt_private.syms | 2 +
src/lxc/lxc_driver.c | 6 +
src/qemu/qemu_command.c | 64 +++++-
src/qemu/qemu_command.h | 4 +
src/qemu/qemu_domain.c | 2 +
src/qemu/qemu_domain_address.c | 193 ++++++++++++++++++
src/qemu/qemu_driver.c | 3 +
src/qemu/qemu_hotplug.c | 5 +
src/qemu/qemu_postparse.c | 1 +
src/qemu/qemu_validate.c | 16 ++
src/test/test_driver.c | 4 +
tests/meson.build | 1 +
.../iommu-nestedsmmuv3.aarch64-latest.args | 38 ++++
.../iommu-nestedsmmuv3.aarch64-latest.xml | 61 ++++++
tests/qemuxmlconfdata/iommu-nestedsmmuv3.xml | 29 +++
tests/qemuxmlconftest.c | 4 +-
tests/schemas/device.rng.in | 1 +
tests/virnestedsmmuv3mock.c | 57 ++++++
28 files changed, 788 insertions(+), 6 deletions(-)
create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.args
create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.xml
create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.xml
create mode 100644 tests/virnestedsmmuv3mock.c
--
2.34.1