Re: [RFC 00/18] vfio: Adopt iommufd

[Cc +libvirt folks] On Thu, 14 Apr 2022 03:46:52 -0700 Yi Liu <yi.l.liu@intel.com> wrote:
With the introduction of iommufd[1], the linux kernel provides a generic interface for userspace drivers to propagate their DMA mappings to kernel for assigned devices. This series does the porting of the VFIO devices onto the /dev/iommu uapi and let it coexist with the legacy implementation. Other devices like vpda, vfio mdev and etc. are not considered yet.
For vfio devices, the new interface is tied with device fd and iommufd as the iommufd solution is device-centric. This is different from legacy vfio which is group-centric. To support both interfaces in QEMU, this series introduces the iommu backend concept in the form of different container classes. The existing vfio container is named legacy container (equivalent with legacy iommu backend in this series), while the new iommufd based container is named as iommufd container (may also be mentioned as iommufd backend in this series). The two backend types have their own way to setup secure context and dma management interface. Below diagram shows how it looks like with both BEs.
VFIO AddressSpace/Memory +-------+ +----------+ +-----+ +-----+ | pci | | platform | | ap | | ccw | +---+---+ +----+-----+ +--+--+ +--+--+ +----------------------+ | | | | | AddressSpace | | | | | +------------+---------+ +---V-----------V-----------V--------V----+ / | VFIOAddressSpace | <------------+ | | | MemoryListener | VFIOContainer list | +-------+----------------------------+----+ | | | | +-------V------+ +--------V----------+ | iommufd | | vfio legacy | | container | | container | +-------+------+ +--------+----------+ | | | /dev/iommu | /dev/vfio/vfio | /dev/vfio/devices/vfioX | /dev/vfio/$group_id Userspace | | ===========+============================+================================ Kernel | device fd | +---------------+ | group/container fd | (BIND_IOMMUFD | | (SET_CONTAINER/SET_IOMMU) | ATTACH_IOAS) | | device fd | | | | +-------V------------V-----------------+ iommufd | | vfio | (map/unmap | +---------+--------------------+-------+ ioas_copy) | | | map/unmap | | | +------V------+ +-----V------+ +------V--------+ | iommfd core | | device | | vfio iommu | +-------------+ +------------+ +---------------+
[Secure Context setup] - iommufd BE: uses device fd and iommufd to setup secure context (bind_iommufd, attach_ioas) - vfio legacy BE: uses group fd and container fd to setup secure context (set_container, set_iommu) [Device access] - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX - vfio legacy BE: device fd is retrieved from group fd ioctl [DMA Mapping flow] - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener - VFIO populates DMA map/unmap via the container BEs *) iommufd BE: uses iommufd *) vfio legacy BE: uses container fd
This series qomifies the VFIOContainer object which acts as a base class for a container. This base class is derived into the legacy VFIO container and the new iommufd based container. The base class implements generic code such as code related to memory_listener and address space management whereas the derived class implements callbacks that depend on the kernel user space being used.
The selection of the backend is made on a device basis using the new iommufd option (on/off/auto). By default the iommufd backend is selected if supported by the host and by QEMU (iommufd KConfig). This option is currently available only for the vfio-pci device. For other types of devices, it does not yet exist and the legacy BE is chosen by default.
I've discussed this a bit with Eric, but let me propose a different command line interface. Libvirt generally likes to pass file descriptors to QEMU rather than grant it access to those files directly. This was problematic with vfio-pci because libvirt can't easily know when QEMU will want to grab another /dev/vfio/vfio container. Therefore we abandoned this approach and instead libvirt grants file permissions. However, with iommufd there's no reason that QEMU ever needs more than a single instance of /dev/iommufd and we're using per device vfio file descriptors, so it seems like a good time to revisit this. The interface I was considering would be to add an iommufd object to QEMU, so we might have a: -device iommufd[,fd=#][,id=foo] For non-libivrt usage this would have the ability to open /dev/iommufd itself if an fd is not provided. This object could be shared with other iommufd users in the VM and maybe we'd allow multiple instances for more esoteric use cases. [NB, maybe this should be a -object rather than -device since the iommufd is not a guest visible device?] The vfio-pci device might then become: -device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo] So essentially we can specify the device via host, sysfsdev, or passing an fd to the vfio device file. When an iommufd object is specified, "foo" in the example above, each of those options would use the vfio-device access mechanism, essentially the same as iommufd=on in your example. With the fd passing option, an iommufd object would be required and necessarily use device level access. In your example, the iommufd=auto seems especially troublesome for libvirt because QEMU is going to have different locked memory requirements based on whether we're using type1 or iommufd, where the latter resolves the duplicate accounting issues. libvirt needs to know deterministically which backed is being used, which this proposal seems to provide, while at the same time bringing us more in line with fd passing. Thoughts? Thanks, Alex

On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:
[Cc +libvirt folks]
On Thu, 14 Apr 2022 03:46:52 -0700 Yi Liu <yi.l.liu@intel.com> wrote:
With the introduction of iommufd[1], the linux kernel provides a generic interface for userspace drivers to propagate their DMA mappings to kernel for assigned devices. This series does the porting of the VFIO devices onto the /dev/iommu uapi and let it coexist with the legacy implementation. Other devices like vpda, vfio mdev and etc. are not considered yet.
snip
The selection of the backend is made on a device basis using the new iommufd option (on/off/auto). By default the iommufd backend is selected if supported by the host and by QEMU (iommufd KConfig). This option is currently available only for the vfio-pci device. For other types of devices, it does not yet exist and the legacy BE is chosen by default.
I've discussed this a bit with Eric, but let me propose a different command line interface. Libvirt generally likes to pass file descriptors to QEMU rather than grant it access to those files directly. This was problematic with vfio-pci because libvirt can't easily know when QEMU will want to grab another /dev/vfio/vfio container. Therefore we abandoned this approach and instead libvirt grants file permissions.
However, with iommufd there's no reason that QEMU ever needs more than a single instance of /dev/iommufd and we're using per device vfio file descriptors, so it seems like a good time to revisit this.
I assume access to '/dev/iommufd' gives the process somewhat elevated privileges, such that you don't want to unconditionally give QEMU access to this device ?
The interface I was considering would be to add an iommufd object to QEMU, so we might have a:
-device iommufd[,fd=#][,id=foo]
For non-libivrt usage this would have the ability to open /dev/iommufd itself if an fd is not provided. This object could be shared with other iommufd users in the VM and maybe we'd allow multiple instances for more esoteric use cases. [NB, maybe this should be a -object rather than -device since the iommufd is not a guest visible device?]
Yes, -object would be the right answer for something that's purely a host side backend impl selector.
The vfio-pci device might then become:
-device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
So essentially we can specify the device via host, sysfsdev, or passing an fd to the vfio device file. When an iommufd object is specified, "foo" in the example above, each of those options would use the vfio-device access mechanism, essentially the same as iommufd=on in your example. With the fd passing option, an iommufd object would be required and necessarily use device level access.
In your example, the iommufd=auto seems especially troublesome for libvirt because QEMU is going to have different locked memory requirements based on whether we're using type1 or iommufd, where the latter resolves the duplicate accounting issues. libvirt needs to know deterministically which backed is being used, which this proposal seems to provide, while at the same time bringing us more in line with fd passing. Thoughts? Thanks,
Yep, I agree that libvirt needs to have more direct control over this. This is also even more important if there are notable feature differences in the 2 backends. I wonder if anyone has considered an even more distinct impl, whereby we have a completely different device type on the backend, eg -device vfio-iommu-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo] If a vendor wants to fully remove the legacy impl, they can then use the Kconfig mechanism to disable the build of the legacy impl device, while keeping the iommu impl (or vica-verca if the new iommu impl isn't considered reliable enough for them to support yet). Libvirt would use -object iommu,id=iommu0,fd=NNN -device vfio-iommu-pci,fd=MMM,iommu=iommu0 Non-libvirt would use a simpler -device vfio-iommu-pci,host=0000:03:22.1 with QEMU auto-creating a 'iommu' object in the background. This would fit into libvirt's existing modelling better. We currently have a concept of a PCI assignment backend, which previously supported the legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl feels like a 3rd PCI assignment approach, and so fits with how we modelled it as a different device type in the past. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Mon, Apr 25, 2022 at 11:10:14AM +0100, Daniel P. Berrangé wrote:
However, with iommufd there's no reason that QEMU ever needs more than a single instance of /dev/iommufd and we're using per device vfio file descriptors, so it seems like a good time to revisit this.
I assume access to '/dev/iommufd' gives the process somewhat elevated privileges, such that you don't want to unconditionally give QEMU access to this device ?
I doesn't give much, at worst it allows userspace to allocate kernel memory and pin pages which can be already be done through all sorts of other interfaces qemu already has access to.. Jason

On Mon, 25 Apr 2022 11:10:14 +0100 Daniel P. Berrangé <berrange@redhat.com> wrote:
On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:
[Cc +libvirt folks]
On Thu, 14 Apr 2022 03:46:52 -0700 Yi Liu <yi.l.liu@intel.com> wrote:
With the introduction of iommufd[1], the linux kernel provides a generic interface for userspace drivers to propagate their DMA mappings to kernel for assigned devices. This series does the porting of the VFIO devices onto the /dev/iommu uapi and let it coexist with the legacy implementation. Other devices like vpda, vfio mdev and etc. are not considered yet.
snip
The selection of the backend is made on a device basis using the new iommufd option (on/off/auto). By default the iommufd backend is selected if supported by the host and by QEMU (iommufd KConfig). This option is currently available only for the vfio-pci device. For other types of devices, it does not yet exist and the legacy BE is chosen by default.
I've discussed this a bit with Eric, but let me propose a different command line interface. Libvirt generally likes to pass file descriptors to QEMU rather than grant it access to those files directly. This was problematic with vfio-pci because libvirt can't easily know when QEMU will want to grab another /dev/vfio/vfio container. Therefore we abandoned this approach and instead libvirt grants file permissions.
However, with iommufd there's no reason that QEMU ever needs more than a single instance of /dev/iommufd and we're using per device vfio file descriptors, so it seems like a good time to revisit this.
I assume access to '/dev/iommufd' gives the process somewhat elevated privileges, such that you don't want to unconditionally give QEMU access to this device ?
It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged interface which should have limited scope for abuse, but more so here the goal would be to de-privilege QEMU that one step further that it cannot open the device file itself.
The interface I was considering would be to add an iommufd object to QEMU, so we might have a:
-device iommufd[,fd=#][,id=foo]
For non-libivrt usage this would have the ability to open /dev/iommufd itself if an fd is not provided. This object could be shared with other iommufd users in the VM and maybe we'd allow multiple instances for more esoteric use cases. [NB, maybe this should be a -object rather than -device since the iommufd is not a guest visible device?]
Yes, -object would be the right answer for something that's purely a host side backend impl selector.
The vfio-pci device might then become:
-device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
So essentially we can specify the device via host, sysfsdev, or passing an fd to the vfio device file. When an iommufd object is specified, "foo" in the example above, each of those options would use the vfio-device access mechanism, essentially the same as iommufd=on in your example. With the fd passing option, an iommufd object would be required and necessarily use device level access.
In your example, the iommufd=auto seems especially troublesome for libvirt because QEMU is going to have different locked memory requirements based on whether we're using type1 or iommufd, where the latter resolves the duplicate accounting issues. libvirt needs to know deterministically which backed is being used, which this proposal seems to provide, while at the same time bringing us more in line with fd passing. Thoughts? Thanks,
Yep, I agree that libvirt needs to have more direct control over this. This is also even more important if there are notable feature differences in the 2 backends.
I wonder if anyone has considered an even more distinct impl, whereby we have a completely different device type on the backend, eg
-device vfio-iommu-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
If a vendor wants to fully remove the legacy impl, they can then use the Kconfig mechanism to disable the build of the legacy impl device, while keeping the iommu impl (or vica-verca if the new iommu impl isn't considered reliable enough for them to support yet).
Libvirt would use
-object iommu,id=iommu0,fd=NNN -device vfio-iommu-pci,fd=MMM,iommu=iommu0
Non-libvirt would use a simpler
-device vfio-iommu-pci,host=0000:03:22.1
with QEMU auto-creating a 'iommu' object in the background.
This would fit into libvirt's existing modelling better. We currently have a concept of a PCI assignment backend, which previously supported the legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl feels like a 3rd PCI assignment approach, and so fits with how we modelled it as a different device type in the past.
I don't think we want to conflate "iommu" and "iommufd", we're creating an object that interfaces into the iommufd uAPI, not an iommu itself. Likewise "vfio-iommu-pci" is just confusing, there was an iommu interface previously, it's just a different implementation now and as far as the VM interface to the device, it's identical. Note that a "vfio-iommufd-pci" device multiplies the matrix of every vfio device for a rather subtle implementation detail. My expectation would be that libvirt uses: -object iommufd,id=iommufd0,fd=NNN -device vfio-pci,fd=MMM,iommufd=iommufd0 Whereas simple QEMU command line would be: -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0 The iommufd object would open /dev/iommufd itself. Creating an implicit iommufd object is someone problematic because one of the things I forgot to highlight in my previous description is that the iommufd object is meant to be shared across not only various vfio devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex. vdpa. If the old style were used: -device vfio-pci,host=0000:02:00.0 Then QEMU would use vfio for the IOMMU backend. If libvirt/userspace wants to query whether "legacy" vfio is still supported by the host kernel, I think it'd only need to look for whether the /dev/vfio/vfio container interface still exists. If we need some means for QEMU to remove legacy support, I'd rather find a way to do it via probing device options. It's easy enough to see if iommufd support exists by looking for the presence of the iommufd option for the vfio-pci device and Kconfig within QEMU could be used regardless of whether we define a new device name. Thanks, Alex

From: Alex Williamson <alex.williamson@redhat.com> Sent: Monday, April 25, 2022 10:38 PM
On Mon, 25 Apr 2022 11:10:14 +0100 Daniel P. Berrangé <berrange@redhat.com> wrote:
On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:
[Cc +libvirt folks]
On Thu, 14 Apr 2022 03:46:52 -0700 Yi Liu <yi.l.liu@intel.com> wrote:
With the introduction of iommufd[1], the linux kernel provides a generic interface for userspace drivers to propagate their DMA mappings to kernel for assigned devices. This series does the porting of the VFIO devices onto the /dev/iommu uapi and let it coexist with the legacy implementation. Other devices like vpda, vfio mdev and etc. are not considered yet.
snip
The selection of the backend is made on a device basis using the new iommufd option (on/off/auto). By default the iommufd backend is selected if supported by the host and by QEMU (iommufd KConfig). This option is currently available only for the vfio-pci device. For other types of devices, it does not yet exist and the legacy BE is chosen by default.
I've discussed this a bit with Eric, but let me propose a different command line interface. Libvirt generally likes to pass file descriptors to QEMU rather than grant it access to those files directly. This was problematic with vfio-pci because libvirt can't easily know when QEMU will want to grab another /dev/vfio/vfio container. Therefore we abandoned this approach and instead libvirt grants file permissions.
However, with iommufd there's no reason that QEMU ever needs more than a single instance of /dev/iommufd and we're using per device vfio file descriptors, so it seems like a good time to revisit this.
I assume access to '/dev/iommufd' gives the process somewhat elevated privileges, such that you don't want to unconditionally give QEMU access to this device ?
It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged interface which should have limited scope for abuse, but more so here the goal would be to de-privilege QEMU that one step further that it cannot open the device file itself.
The interface I was considering would be to add an iommufd object to QEMU, so we might have a:
-device iommufd[,fd=#][,id=foo]
For non-libivrt usage this would have the ability to open /dev/iommufd itself if an fd is not provided. This object could be shared with other iommufd users in the VM and maybe we'd allow multiple instances for more esoteric use cases. [NB, maybe this should be a -object rather than -device since the iommufd is not a guest visible device?]
Yes, -object would be the right answer for something that's purely a host side backend impl selector.
The vfio-pci device might then become:
-device vfio- pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f oo]
So essentially we can specify the device via host, sysfsdev, or passing an fd to the vfio device file. When an iommufd object is specified, "foo" in the example above, each of those options would use the vfio-device access mechanism, essentially the same as iommufd=on in your example. With the fd passing option, an iommufd object would be required and necessarily use device level access.
In your example, the iommufd=auto seems especially troublesome for libvirt because QEMU is going to have different locked memory requirements based on whether we're using type1 or iommufd, where the latter resolves the duplicate accounting issues. libvirt needs to know
Based on current plan there is probably a transition window between the point where the first vfio device type (vfio-pci) gaining iommufd support and the point where all vfio types supporting iommufd. Libvirt can figure out which one to use iommufd by checking the presence of /dev/vfio/devices/vfioX. But what would be the resource limit policy in Libvirt in such transition window when both type1 and iommufd might be used? Or do we just expect Libvirt to support iommufd only after the transition window ends to avoid handling such mess?
deterministically which backed is being used, which this proposal seems to provide, while at the same time bringing us more in line with fd passing. Thoughts? Thanks,
Yep, I agree that libvirt needs to have more direct control over this. This is also even more important if there are notable feature differences in the 2 backends.
I wonder if anyone has considered an even more distinct impl, whereby we have a completely different device type on the backend, eg
-device vfio-iommu- pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f oo]
If a vendor wants to fully remove the legacy impl, they can then use the Kconfig mechanism to disable the build of the legacy impl device, while keeping the iommu impl (or vica-verca if the new iommu impl isn't considered reliable enough for them to support yet).
Libvirt would use
-object iommu,id=iommu0,fd=NNN -device vfio-iommu-pci,fd=MMM,iommu=iommu0
Non-libvirt would use a simpler
-device vfio-iommu-pci,host=0000:03:22.1
with QEMU auto-creating a 'iommu' object in the background.
This would fit into libvirt's existing modelling better. We currently have a concept of a PCI assignment backend, which previously supported the legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl feels like a 3rd PCI assignment approach, and so fits with how we modelled it as a different device type in the past.
I don't think we want to conflate "iommu" and "iommufd", we're creating an object that interfaces into the iommufd uAPI, not an iommu itself. Likewise "vfio-iommu-pci" is just confusing, there was an iommu interface previously, it's just a different implementation now and as far as the VM interface to the device, it's identical. Note that a "vfio-iommufd-pci" device multiplies the matrix of every vfio device for a rather subtle implementation detail.
My expectation would be that libvirt uses:
-object iommufd,id=iommufd0,fd=NNN -device vfio-pci,fd=MMM,iommufd=iommufd0
Whereas simple QEMU command line would be:
-object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
The iommufd object would open /dev/iommufd itself. Creating an implicit iommufd object is someone problematic because one of the things I forgot to highlight in my previous description is that the iommufd object is meant to be shared across not only various vfio devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex. vdpa.
Out of curiosity - in concept one iommufd is sufficient to support all ioas requirements across subsystems while having multiple iommufd's instead lose the benefit of centralized accounting. The latter will also cause some trouble when we start virtualizing ENQCMD which requires VM-wide PASID virtualization thus further needs to share that information across iommufd's. Not unsolvable but really no gain by adding such complexity. So I'm curious whether Qemu provide a way to restrict that certain object type can only have one instance to discourage such multi-iommufd attempt?
If the old style were used:
-device vfio-pci,host=0000:02:00.0
Then QEMU would use vfio for the IOMMU backend.
If libvirt/userspace wants to query whether "legacy" vfio is still supported by the host kernel, I think it'd only need to look for whether the /dev/vfio/vfio container interface still exists.
If we need some means for QEMU to remove legacy support, I'd rather find a way to do it via probing device options. It's easy enough to see if iommufd support exists by looking for the presence of the iommufd option for the vfio-pci device and Kconfig within QEMU could be used regardless of whether we define a new device name. Thanks,
Alex

On Tue, Apr 26, 2022 at 08:37:41AM +0000, Tian, Kevin wrote:
Based on current plan there is probably a transition window between the point where the first vfio device type (vfio-pci) gaining iommufd support and the point where all vfio types supporting iommufd.
I am still hoping to do all in one shot, lets see :) Jason

On Tue, 26 Apr 2022 08:37:41 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote:
From: Alex Williamson <alex.williamson@redhat.com> Sent: Monday, April 25, 2022 10:38 PM
On Mon, 25 Apr 2022 11:10:14 +0100 Daniel P. Berrangé <berrange@redhat.com> wrote:
On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:
[Cc +libvirt folks]
On Thu, 14 Apr 2022 03:46:52 -0700 Yi Liu <yi.l.liu@intel.com> wrote:
With the introduction of iommufd[1], the linux kernel provides a generic interface for userspace drivers to propagate their DMA mappings to kernel for assigned devices. This series does the porting of the VFIO devices onto the /dev/iommu uapi and let it coexist with the legacy implementation. Other devices like vpda, vfio mdev and etc. are not considered yet.
snip
The selection of the backend is made on a device basis using the new iommufd option (on/off/auto). By default the iommufd backend is selected if supported by the host and by QEMU (iommufd KConfig). This option is currently available only for the vfio-pci device. For other types of devices, it does not yet exist and the legacy BE is chosen by default.
I've discussed this a bit with Eric, but let me propose a different command line interface. Libvirt generally likes to pass file descriptors to QEMU rather than grant it access to those files directly. This was problematic with vfio-pci because libvirt can't easily know when QEMU will want to grab another /dev/vfio/vfio container. Therefore we abandoned this approach and instead libvirt grants file permissions.
However, with iommufd there's no reason that QEMU ever needs more than a single instance of /dev/iommufd and we're using per device vfio file descriptors, so it seems like a good time to revisit this.
I assume access to '/dev/iommufd' gives the process somewhat elevated privileges, such that you don't want to unconditionally give QEMU access to this device ?
It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged interface which should have limited scope for abuse, but more so here the goal would be to de-privilege QEMU that one step further that it cannot open the device file itself.
The interface I was considering would be to add an iommufd object to QEMU, so we might have a:
-device iommufd[,fd=#][,id=foo]
For non-libivrt usage this would have the ability to open /dev/iommufd itself if an fd is not provided. This object could be shared with other iommufd users in the VM and maybe we'd allow multiple instances for more esoteric use cases. [NB, maybe this should be a -object rather than -device since the iommufd is not a guest visible device?]
Yes, -object would be the right answer for something that's purely a host side backend impl selector.
The vfio-pci device might then become:
-device vfio- pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f oo]
So essentially we can specify the device via host, sysfsdev, or passing an fd to the vfio device file. When an iommufd object is specified, "foo" in the example above, each of those options would use the vfio-device access mechanism, essentially the same as iommufd=on in your example. With the fd passing option, an iommufd object would be required and necessarily use device level access.
In your example, the iommufd=auto seems especially troublesome for libvirt because QEMU is going to have different locked memory requirements based on whether we're using type1 or iommufd, where the latter resolves the duplicate accounting issues. libvirt needs to know
Based on current plan there is probably a transition window between the point where the first vfio device type (vfio-pci) gaining iommufd support and the point where all vfio types supporting iommufd. Libvirt can figure out which one to use iommufd by checking the presence of /dev/vfio/devices/vfioX. But what would be the resource limit policy in Libvirt in such transition window when both type1 and iommufd might be used? Or do we just expect Libvirt to support iommufd only after the transition window ends to avoid handling such mess?
Good point regarding libvirt testing for the vfio device files for use with iommufd, so libvirt would test if /dev/iommufd exists and if the device they want to assign maps to a /dev/vfio/devices/vfioX file. This was essentially implicit in the fd=# option to the vfio-pci device. In mixed combinations, I'd expect libvirt to continue to add the full VM memory to the locked memory limit for each non-iommufd device added.
deterministically which backed is being used, which this proposal seems to provide, while at the same time bringing us more in line with fd passing. Thoughts? Thanks,
Yep, I agree that libvirt needs to have more direct control over this. This is also even more important if there are notable feature differences in the 2 backends.
I wonder if anyone has considered an even more distinct impl, whereby we have a completely different device type on the backend, eg
-device vfio-iommu- pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f oo]
If a vendor wants to fully remove the legacy impl, they can then use the Kconfig mechanism to disable the build of the legacy impl device, while keeping the iommu impl (or vica-verca if the new iommu impl isn't considered reliable enough for them to support yet).
Libvirt would use
-object iommu,id=iommu0,fd=NNN -device vfio-iommu-pci,fd=MMM,iommu=iommu0
Non-libvirt would use a simpler
-device vfio-iommu-pci,host=0000:03:22.1
with QEMU auto-creating a 'iommu' object in the background.
This would fit into libvirt's existing modelling better. We currently have a concept of a PCI assignment backend, which previously supported the legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl feels like a 3rd PCI assignment approach, and so fits with how we modelled it as a different device type in the past.
I don't think we want to conflate "iommu" and "iommufd", we're creating an object that interfaces into the iommufd uAPI, not an iommu itself. Likewise "vfio-iommu-pci" is just confusing, there was an iommu interface previously, it's just a different implementation now and as far as the VM interface to the device, it's identical. Note that a "vfio-iommufd-pci" device multiplies the matrix of every vfio device for a rather subtle implementation detail.
My expectation would be that libvirt uses:
-object iommufd,id=iommufd0,fd=NNN -device vfio-pci,fd=MMM,iommufd=iommufd0
Whereas simple QEMU command line would be:
-object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
The iommufd object would open /dev/iommufd itself. Creating an implicit iommufd object is someone problematic because one of the things I forgot to highlight in my previous description is that the iommufd object is meant to be shared across not only various vfio devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex. vdpa.
Out of curiosity - in concept one iommufd is sufficient to support all ioas requirements across subsystems while having multiple iommufd's instead lose the benefit of centralized accounting. The latter will also cause some trouble when we start virtualizing ENQCMD which requires VM-wide PASID virtualization thus further needs to share that information across iommufd's. Not unsolvable but really no gain by adding such complexity. So I'm curious whether Qemu provide a way to restrict that certain object type can only have one instance to discourage such multi-iommufd attempt?
I don't see any reason for QEMU to restrict iommufd objects. The QEMU philosophy seems to be to let users create whatever configuration they want. For libvirt though, the assumption would be that a single iommufd object can be used across subsystems, so libvirt would never automatically create multiple objects. We also need to be able to advise libvirt as to how each iommufd object or user of that object factors into the VM locked memory requirement. When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt to set the locked memory limit to the size of VM RAM per iommufd, regardless of the number of devices using a given iommufd. However, I don't know if all users of iommufd will be exclusively mapping VM RAM. Combinations of devices where some map VM RAM and others map QEMU buffer space could still require some incremental increase per device (I'm not sure if vfio-nvme is such a device). It seems like heuristics will still be involved even after iommufd solves the per-device vfio-pci locked memory limit issue. Thanks, Alex

On Tue, Apr 26, 2022 at 10:21:59AM -0600, Alex Williamson wrote:
We also need to be able to advise libvirt as to how each iommufd object or user of that object factors into the VM locked memory requirement. When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt to set the locked memory limit to the size of VM RAM per iommufd, regardless of the number of devices using a given iommufd. However, I don't know if all users of iommufd will be exclusively mapping VM RAM. Combinations of devices where some map VM RAM and others map QEMU buffer space could still require some incremental increase per device (I'm not sure if vfio-nvme is such a device). It seems like heuristics will still be involved even after iommufd solves the per-device vfio-pci locked memory limit issue. Thanks,
If the model is to pass the FD, how about we put a limit on the FD itself instead of abusing the locked memory limit? We could have a no-way-out ioctl that directly limits the # of PFNs covered by iopt_pages inside an iommufd. Jason

On Tue, 26 Apr 2022 13:42:17 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote:
On Tue, Apr 26, 2022 at 10:21:59AM -0600, Alex Williamson wrote:
We also need to be able to advise libvirt as to how each iommufd object or user of that object factors into the VM locked memory requirement. When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt to set the locked memory limit to the size of VM RAM per iommufd, regardless of the number of devices using a given iommufd. However, I don't know if all users of iommufd will be exclusively mapping VM RAM. Combinations of devices where some map VM RAM and others map QEMU buffer space could still require some incremental increase per device (I'm not sure if vfio-nvme is such a device). It seems like heuristics will still be involved even after iommufd solves the per-device vfio-pci locked memory limit issue. Thanks,
If the model is to pass the FD, how about we put a limit on the FD itself instead of abusing the locked memory limit?
We could have a no-way-out ioctl that directly limits the # of PFNs covered by iopt_pages inside an iommufd.
FD passing would likely only be the standard for libvirt invoked VMs. The QEMU vfio-pci device would still parse a host= or sysfsdev= option when invoked by mortals and associate to use the legacy vfio group interface or the new vfio device interface based on whether an iommufd is specified. Does that rule out your suggestion? I don't know, please reveal more about the mechanics of putting a limit on the FD itself and this no-way-out ioctl. The latter name suggests to me that I should also note that we need to support memory hotplug with these devices. Thanks, Alex

On Tue, Apr 26, 2022 at 01:24:35PM -0600, Alex Williamson wrote:
On Tue, 26 Apr 2022 13:42:17 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote:
On Tue, Apr 26, 2022 at 10:21:59AM -0600, Alex Williamson wrote:
We also need to be able to advise libvirt as to how each iommufd object or user of that object factors into the VM locked memory requirement. When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt to set the locked memory limit to the size of VM RAM per iommufd, regardless of the number of devices using a given iommufd. However, I don't know if all users of iommufd will be exclusively mapping VM RAM. Combinations of devices where some map VM RAM and others map QEMU buffer space could still require some incremental increase per device (I'm not sure if vfio-nvme is such a device). It seems like heuristics will still be involved even after iommufd solves the per-device vfio-pci locked memory limit issue. Thanks,
If the model is to pass the FD, how about we put a limit on the FD itself instead of abusing the locked memory limit?
We could have a no-way-out ioctl that directly limits the # of PFNs covered by iopt_pages inside an iommufd.
FD passing would likely only be the standard for libvirt invoked VMs. The QEMU vfio-pci device would still parse a host= or sysfsdev= option when invoked by mortals and associate to use the legacy vfio group interface or the new vfio device interface based on whether an iommufd is specified.
Yes, but perhaps we don't need resource limits in the mortals case..
Does that rule out your suggestion? I don't know, please reveal more about the mechanics of putting a limit on the FD itself and this no-way-out ioctl. The latter name suggests to me that I should also note that we need to support memory hotplug with these devices. Thanks,
So libvirt uses CAP_SYS_RESOURCE and prlimit to adjust things in realtime today? It could still work, instead of no way out iommufd would have to check for CAP_SYS_RESOURCE to make the limit higher. It is a pretty simple idea, we just attach a resource limit to the FD and every PFN that gets mapped into the iommufd counts against that limit, regardless if it is pinned or not. An ioctl on the FD would set the limit, defaulting to unlimited. To me this has the appeal that what is being resourced controlled is strictly defined - address space mapped into an iommufd - which has a bunch of nice additional consequences like partially bounding the amount of kernel memory an iommufd can consume and so forth. Doesn't interact with iouring or rdma however. Though we could certianly consider allowing RDMA to consume an iommufd to access pinned pages much like a vfio-mdev would - I'm not sure what is ideal for the qemu usage of RDMA for migration.. Jason

From: Alex Williamson <alex.williamson@redhat.com> Sent: Wednesday, April 27, 2022 12:22 AM
My expectation would be that libvirt uses:
-object iommufd,id=iommufd0,fd=NNN -device vfio-pci,fd=MMM,iommufd=iommufd0
Whereas simple QEMU command line would be:
-object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
The iommufd object would open /dev/iommufd itself. Creating an implicit iommufd object is someone problematic because one of the things I forgot to highlight in my previous description is that the iommufd object is meant to be shared across not only various vfio devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex. vdpa.
Out of curiosity - in concept one iommufd is sufficient to support all ioas requirements across subsystems while having multiple iommufd's instead lose the benefit of centralized accounting. The latter will also cause some trouble when we start virtualizing ENQCMD which requires VM-wide PASID virtualization thus further needs to share that information across iommufd's. Not unsolvable but really no gain by adding such complexity. So I'm curious whether Qemu provide a way to restrict that certain object type can only have one instance to discourage such multi-iommufd attempt?
I don't see any reason for QEMU to restrict iommufd objects. The QEMU philosophy seems to be to let users create whatever configuration they want. For libvirt though, the assumption would be that a single iommufd object can be used across subsystems, so libvirt would never automatically create multiple objects.
I like the flexibility what the objection approach gives in your proposal. But with the said complexity in mind (with no foreseen benefit), I wonder whether an alternative approach which treats iommufd as a global property instead of an object is acceptable in Qemu, i.e.: -iommufd on/off -device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0] All devices with iommufd specified then implicitly share a single iommufd object within Qemu. This still allows vfio devices to be specified via fd but just requires Libvirt to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be considered or just not a typical way in Qemu philosophy e.g. any object associated with a device must be explicitly specified? Thanks Kevin

On Thu, 28 Apr 2022 03:21:45 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote:
From: Alex Williamson <alex.williamson@redhat.com> Sent: Wednesday, April 27, 2022 12:22 AM
My expectation would be that libvirt uses:
-object iommufd,id=iommufd0,fd=NNN -device vfio-pci,fd=MMM,iommufd=iommufd0
Whereas simple QEMU command line would be:
-object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
The iommufd object would open /dev/iommufd itself. Creating an implicit iommufd object is someone problematic because one of the things I forgot to highlight in my previous description is that the iommufd object is meant to be shared across not only various vfio devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex. vdpa.
Out of curiosity - in concept one iommufd is sufficient to support all ioas requirements across subsystems while having multiple iommufd's instead lose the benefit of centralized accounting. The latter will also cause some trouble when we start virtualizing ENQCMD which requires VM-wide PASID virtualization thus further needs to share that information across iommufd's. Not unsolvable but really no gain by adding such complexity. So I'm curious whether Qemu provide a way to restrict that certain object type can only have one instance to discourage such multi-iommufd attempt?
I don't see any reason for QEMU to restrict iommufd objects. The QEMU philosophy seems to be to let users create whatever configuration they want. For libvirt though, the assumption would be that a single iommufd object can be used across subsystems, so libvirt would never automatically create multiple objects.
I like the flexibility what the objection approach gives in your proposal. But with the said complexity in mind (with no foreseen benefit), I wonder
What's the actual complexity? Front-end/backend splits are very common in QEMU. We're making the object connection via name, why is it significantly more complicated to allow multiple iommufd objects? On the contrary, it seems to me that we'd need to go out of our way to add code to block multiple iommufd objects.
whether an alternative approach which treats iommufd as a global property instead of an object is acceptable in Qemu, i.e.:
-iommufd on/off -device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0]
All devices with iommufd specified then implicitly share a single iommufd object within Qemu.
QEMU requires key-value pairs AFAIK, so the above doesn't work, then we're just back to the iommufd=on/off.
This still allows vfio devices to be specified via fd but just requires Libvirt to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be considered or just not a typical way in Qemu philosophy e.g. any object associated with a device must be explicitly specified?
Avoiding QEMU opening files was a significant focus of my alternate proposal. Also note that we must be able to support hotplug, so we need to be able to dynamically add and remove the iommufd object, I don't see that a global property allows for that. Implicit associations of devices to shared resources doesn't seem particularly desirable to me. Thanks, Alex

On Thu, Apr 28, 2022 at 08:24:48AM -0600, Alex Williamson wrote:
On Thu, 28 Apr 2022 03:21:45 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote:
From: Alex Williamson <alex.williamson@redhat.com> Sent: Wednesday, April 27, 2022 12:22 AM
My expectation would be that libvirt uses:
-object iommufd,id=iommufd0,fd=NNN -device vfio-pci,fd=MMM,iommufd=iommufd0
Whereas simple QEMU command line would be:
-object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
The iommufd object would open /dev/iommufd itself. Creating an implicit iommufd object is someone problematic because one of the things I forgot to highlight in my previous description is that the iommufd object is meant to be shared across not only various vfio devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex. vdpa.
Out of curiosity - in concept one iommufd is sufficient to support all ioas requirements across subsystems while having multiple iommufd's instead lose the benefit of centralized accounting. The latter will also cause some trouble when we start virtualizing ENQCMD which requires VM-wide PASID virtualization thus further needs to share that information across iommufd's. Not unsolvable but really no gain by adding such complexity. So I'm curious whether Qemu provide a way to restrict that certain object type can only have one instance to discourage such multi-iommufd attempt?
I don't see any reason for QEMU to restrict iommufd objects. The QEMU philosophy seems to be to let users create whatever configuration they want. For libvirt though, the assumption would be that a single iommufd object can be used across subsystems, so libvirt would never automatically create multiple objects.
I like the flexibility what the objection approach gives in your proposal. But with the said complexity in mind (with no foreseen benefit), I wonder
What's the actual complexity? Front-end/backend splits are very common in QEMU. We're making the object connection via name, why is it significantly more complicated to allow multiple iommufd objects? On the contrary, it seems to me that we'd need to go out of our way to add code to block multiple iommufd objects.
whether an alternative approach which treats iommufd as a global property instead of an object is acceptable in Qemu, i.e.:
-iommufd on/off -device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0]
All devices with iommufd specified then implicitly share a single iommufd object within Qemu.
QEMU requires key-value pairs AFAIK, so the above doesn't work, then we're just back to the iommufd=on/off.
This still allows vfio devices to be specified via fd but just requires Libvirt to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be considered or just not a typical way in Qemu philosophy e.g. any object associated with a device must be explicitly specified?
Avoiding QEMU opening files was a significant focus of my alternate proposal. Also note that we must be able to support hotplug, so we need to be able to dynamically add and remove the iommufd object, I don't see that a global property allows for that. Implicit associations of devices to shared resources doesn't seem particularly desirable to me. Thanks,
Adding new global properties/options is rather an anti-pattern for QEMU these days. Using -object is the right approach. If you only want to allow for one of them, just document this requirement. We've got other objects which are singletons like all the confidential guest classes for each arch. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

From: Daniel P. Berrangé <berrange@redhat.com> Sent: Friday, April 29, 2022 12:20 AM
On Thu, Apr 28, 2022 at 08:24:48AM -0600, Alex Williamson wrote:
On Thu, 28 Apr 2022 03:21:45 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote:
From: Alex Williamson <alex.williamson@redhat.com> Sent: Wednesday, April 27, 2022 12:22 AM
My expectation would be that libvirt uses:
-object iommufd,id=iommufd0,fd=NNN -device vfio-pci,fd=MMM,iommufd=iommufd0
Whereas simple QEMU command line would be:
-object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
The iommufd object would open /dev/iommufd itself. Creating an implicit iommufd object is someone problematic because one of the things I forgot to highlight in my previous description is that the iommufd object is meant to be shared across not only various vfio devices (platform, ccw, ap, nvme, etc), but also across subsystems,
ex.
vdpa.
Out of curiosity - in concept one iommufd is sufficient to support all ioas requirements across subsystems while having multiple iommufd's instead lose the benefit of centralized accounting. The latter will also cause some trouble when we start virtualizing ENQCMD which requires VM-wide PASID virtualization thus further needs to share that information across iommufd's. Not unsolvable but really no gain by adding such complexity. So I'm curious whether Qemu provide a way to restrict that certain object type can only have one instance to discourage such multi-iommufd attempt?
I don't see any reason for QEMU to restrict iommufd objects. The QEMU philosophy seems to be to let users create whatever configuration they want. For libvirt though, the assumption would be that a single iommufd object can be used across subsystems, so libvirt would never automatically create multiple objects.
I like the flexibility what the objection approach gives in your proposal. But with the said complexity in mind (with no foreseen benefit), I wonder
What's the actual complexity? Front-end/backend splits are very common in QEMU. We're making the object connection via name, why is it significantly more complicated to allow multiple iommufd objects? On the contrary, it seems to me that we'd need to go out of our way to add code to block multiple iommufd objects.
Probably it's just a hypothetical concern when I thought about the need of managing certain global information (e.g. PASID virtualization) cross iommufd's down the road. With your and Daniel's replies I think we'll first try to follow the common practice in Qemu first given there are more positive reasons to do so than the hypothetical concern itself.
whether an alternative approach which treats iommufd as a global property instead of an object is acceptable in Qemu, i.e.:
-iommufd on/off -device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0]
All devices with iommufd specified then implicitly share a single iommufd object within Qemu.
QEMU requires key-value pairs AFAIK, so the above doesn't work, then we're just back to the iommufd=on/off.
This still allows vfio devices to be specified via fd but just requires Libvirt to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be considered or just not a typical way in Qemu philosophy e.g. any object associated with a device must be explicitly specified?
Avoiding QEMU opening files was a significant focus of my alternate proposal. Also note that we must be able to support hotplug, so we need to be able to dynamically add and remove the iommufd object, I don't see that a global property allows for that. Implicit associations of devices to shared resources doesn't seem particularly desirable to me. Thanks,
Adding new global properties/options is rather an anti-pattern for QEMU these days. Using -object is the right approach. If you only want to allow for one of them, just document this requirement. We've got other objects which are singletons like all the confidential guest classes for each arch.
Good to know such last resort. As said we'll try to avoid this restriction and follow Alex's proposal unless there are unexpectedly unreasonable complexities arising later. Thanks Kevin

Hi Alex, On 4/23/22 12:09 AM, Alex Williamson wrote:
[Cc +libvirt folks]
On Thu, 14 Apr 2022 03:46:52 -0700 Yi Liu <yi.l.liu@intel.com> wrote:
With the introduction of iommufd[1], the linux kernel provides a generic interface for userspace drivers to propagate their DMA mappings to kernel for assigned devices. This series does the porting of the VFIO devices onto the /dev/iommu uapi and let it coexist with the legacy implementation. Other devices like vpda, vfio mdev and etc. are not considered yet.
For vfio devices, the new interface is tied with device fd and iommufd as the iommufd solution is device-centric. This is different from legacy vfio which is group-centric. To support both interfaces in QEMU, this series introduces the iommu backend concept in the form of different container classes. The existing vfio container is named legacy container (equivalent with legacy iommu backend in this series), while the new iommufd based container is named as iommufd container (may also be mentioned as iommufd backend in this series). The two backend types have their own way to setup secure context and dma management interface. Below diagram shows how it looks like with both BEs.
VFIO AddressSpace/Memory +-------+ +----------+ +-----+ +-----+ | pci | | platform | | ap | | ccw | +---+---+ +----+-----+ +--+--+ +--+--+ +----------------------+ | | | | | AddressSpace | | | | | +------------+---------+ +---V-----------V-----------V--------V----+ / | VFIOAddressSpace | <------------+ | | | MemoryListener | VFIOContainer list | +-------+----------------------------+----+ | | | | +-------V------+ +--------V----------+ | iommufd | | vfio legacy | | container | | container | +-------+------+ +--------+----------+ | | | /dev/iommu | /dev/vfio/vfio | /dev/vfio/devices/vfioX | /dev/vfio/$group_id Userspace | | ===========+============================+================================ Kernel | device fd | +---------------+ | group/container fd | (BIND_IOMMUFD | | (SET_CONTAINER/SET_IOMMU) | ATTACH_IOAS) | | device fd | | | | +-------V------------V-----------------+ iommufd | | vfio | (map/unmap | +---------+--------------------+-------+ ioas_copy) | | | map/unmap | | | +------V------+ +-----V------+ +------V--------+ | iommfd core | | device | | vfio iommu | +-------------+ +------------+ +---------------+
[Secure Context setup] - iommufd BE: uses device fd and iommufd to setup secure context (bind_iommufd, attach_ioas) - vfio legacy BE: uses group fd and container fd to setup secure context (set_container, set_iommu) [Device access] - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX - vfio legacy BE: device fd is retrieved from group fd ioctl [DMA Mapping flow] - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener - VFIO populates DMA map/unmap via the container BEs *) iommufd BE: uses iommufd *) vfio legacy BE: uses container fd
This series qomifies the VFIOContainer object which acts as a base class for a container. This base class is derived into the legacy VFIO container and the new iommufd based container. The base class implements generic code such as code related to memory_listener and address space management whereas the derived class implements callbacks that depend on the kernel user space being used.
The selection of the backend is made on a device basis using the new iommufd option (on/off/auto). By default the iommufd backend is selected if supported by the host and by QEMU (iommufd KConfig). This option is currently available only for the vfio-pci device. For other types of devices, it does not yet exist and the legacy BE is chosen by default. I've discussed this a bit with Eric, but let me propose a different command line interface. Libvirt generally likes to pass file descriptors to QEMU rather than grant it access to those files directly. This was problematic with vfio-pci because libvirt can't easily know when QEMU will want to grab another /dev/vfio/vfio container. Therefore we abandoned this approach and instead libvirt grants file permissions.
However, with iommufd there's no reason that QEMU ever needs more than a single instance of /dev/iommufd and we're using per device vfio file descriptors, so it seems like a good time to revisit this.
The interface I was considering would be to add an iommufd object to QEMU, so we might have a:
-device iommufd[,fd=#][,id=foo]
For non-libivrt usage this would have the ability to open /dev/iommufd itself if an fd is not provided. This object could be shared with other iommufd users in the VM and maybe we'd allow multiple instances for more esoteric use cases. [NB, maybe this should be a -object rather than -device since the iommufd is not a guest visible device?]
The vfio-pci device might then become:
-device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
So essentially we can specify the device via host, sysfsdev, or passing an fd to the vfio device file. When an iommufd object is specified, "foo" in the example above, each of those options would use the vfio-device access mechanism, essentially the same as iommufd=on in your example. With the fd passing option, an iommufd object would be required and necessarily use device level access. What is the use case you foresee for the "fd=#" option?
In your example, the iommufd=auto seems especially troublesome for libvirt because QEMU is going to have different locked memory requirements based on whether we're using type1 or iommufd, where the latter resolves the duplicate accounting issues. libvirt needs to know deterministically which backed is being used, which this proposal seems to provide, while at the same time bringing us more in line with fd passing. Thoughts? Thanks, I like your proposal (based on the -object iommufd). The only thing that may be missing I think is for a qemu end-user who actually does not care about the iommu backend being used but just wishes to use the most recent available one it adds some extra complexity. But this is not the most important use case ;)
Thanks Eric
Alex

On Mon, 25 Apr 2022 22:23:05 +0200 Eric Auger <eric.auger@redhat.com> wrote:
Hi Alex,
On 4/23/22 12:09 AM, Alex Williamson wrote:
[Cc +libvirt folks]
On Thu, 14 Apr 2022 03:46:52 -0700 Yi Liu <yi.l.liu@intel.com> wrote:
With the introduction of iommufd[1], the linux kernel provides a generic interface for userspace drivers to propagate their DMA mappings to kernel for assigned devices. This series does the porting of the VFIO devices onto the /dev/iommu uapi and let it coexist with the legacy implementation. Other devices like vpda, vfio mdev and etc. are not considered yet.
For vfio devices, the new interface is tied with device fd and iommufd as the iommufd solution is device-centric. This is different from legacy vfio which is group-centric. To support both interfaces in QEMU, this series introduces the iommu backend concept in the form of different container classes. The existing vfio container is named legacy container (equivalent with legacy iommu backend in this series), while the new iommufd based container is named as iommufd container (may also be mentioned as iommufd backend in this series). The two backend types have their own way to setup secure context and dma management interface. Below diagram shows how it looks like with both BEs.
VFIO AddressSpace/Memory +-------+ +----------+ +-----+ +-----+ | pci | | platform | | ap | | ccw | +---+---+ +----+-----+ +--+--+ +--+--+ +----------------------+ | | | | | AddressSpace | | | | | +------------+---------+ +---V-----------V-----------V--------V----+ / | VFIOAddressSpace | <------------+ | | | MemoryListener | VFIOContainer list | +-------+----------------------------+----+ | | | | +-------V------+ +--------V----------+ | iommufd | | vfio legacy | | container | | container | +-------+------+ +--------+----------+ | | | /dev/iommu | /dev/vfio/vfio | /dev/vfio/devices/vfioX | /dev/vfio/$group_id Userspace | | ===========+============================+================================ Kernel | device fd | +---------------+ | group/container fd | (BIND_IOMMUFD | | (SET_CONTAINER/SET_IOMMU) | ATTACH_IOAS) | | device fd | | | | +-------V------------V-----------------+ iommufd | | vfio | (map/unmap | +---------+--------------------+-------+ ioas_copy) | | | map/unmap | | | +------V------+ +-----V------+ +------V--------+ | iommfd core | | device | | vfio iommu | +-------------+ +------------+ +---------------+
[Secure Context setup] - iommufd BE: uses device fd and iommufd to setup secure context (bind_iommufd, attach_ioas) - vfio legacy BE: uses group fd and container fd to setup secure context (set_container, set_iommu) [Device access] - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX - vfio legacy BE: device fd is retrieved from group fd ioctl [DMA Mapping flow] - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener - VFIO populates DMA map/unmap via the container BEs *) iommufd BE: uses iommufd *) vfio legacy BE: uses container fd
This series qomifies the VFIOContainer object which acts as a base class for a container. This base class is derived into the legacy VFIO container and the new iommufd based container. The base class implements generic code such as code related to memory_listener and address space management whereas the derived class implements callbacks that depend on the kernel user space being used.
The selection of the backend is made on a device basis using the new iommufd option (on/off/auto). By default the iommufd backend is selected if supported by the host and by QEMU (iommufd KConfig). This option is currently available only for the vfio-pci device. For other types of devices, it does not yet exist and the legacy BE is chosen by default. I've discussed this a bit with Eric, but let me propose a different command line interface. Libvirt generally likes to pass file descriptors to QEMU rather than grant it access to those files directly. This was problematic with vfio-pci because libvirt can't easily know when QEMU will want to grab another /dev/vfio/vfio container. Therefore we abandoned this approach and instead libvirt grants file permissions.
However, with iommufd there's no reason that QEMU ever needs more than a single instance of /dev/iommufd and we're using per device vfio file descriptors, so it seems like a good time to revisit this.
The interface I was considering would be to add an iommufd object to QEMU, so we might have a:
-device iommufd[,fd=#][,id=foo]
For non-libivrt usage this would have the ability to open /dev/iommufd itself if an fd is not provided. This object could be shared with other iommufd users in the VM and maybe we'd allow multiple instances for more esoteric use cases. [NB, maybe this should be a -object rather than -device since the iommufd is not a guest visible device?]
The vfio-pci device might then become:
-device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
So essentially we can specify the device via host, sysfsdev, or passing an fd to the vfio device file. When an iommufd object is specified, "foo" in the example above, each of those options would use the vfio-device access mechanism, essentially the same as iommufd=on in your example. With the fd passing option, an iommufd object would be required and necessarily use device level access. What is the use case you foresee for the "fd=#" option?
On the vfio-pci device this was intended to be the actual vfio device file descriptor. Once we have a file per device, QEMU doesn't really have any need to navigate through sysfs to determine which fd to use other than for user convenience on the command line. For libvirt usage, I assume QEMU could accept the device fd, without ever really knowing anything about the host address or sysfs path of the device.
In your example, the iommufd=auto seems especially troublesome for libvirt because QEMU is going to have different locked memory requirements based on whether we're using type1 or iommufd, where the latter resolves the duplicate accounting issues. libvirt needs to know deterministically which backed is being used, which this proposal seems to provide, while at the same time bringing us more in line with fd passing. Thoughts? Thanks,
I like your proposal (based on the -object iommufd). The only thing that may be missing I think is for a qemu end-user who actually does not care about the iommu backend being used but just wishes to use the most recent available one it adds some extra complexity. But this is not the most important use case ;)
Yeah, I can sympathize with that, but isn't that also why we're pursing a vfio compatibility interface at the kernel level? Eventually, once the native vfio IOMMU backends go away, the vfio "container" device file will be provided by iommufd and that transition to the new interface can be both seamless to the user and apparent to tools like libvirt. An end-user with a fixed command line should continue to work and will eventually get iommufd via compatibility, but taking care of an end-user that "does not care" and "wishes to use the most recent" is a non-goal for me. That would be more troublesome for tools and use cases that we do care about imo. Thanks, Alex
participants (5)
-
Alex Williamson
-
Daniel P. Berrangé
-
Eric Auger
-
Jason Gunthorpe
-
Tian, Kevin