Re: [RFC 00/18] vfio: Adopt iommufd

26 Apr 2022

      On Tue, 26 Apr 2022 08:37:41 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:
...
...
From: Alex Williamson <alex.williamson@redhat.com>
Sent: Monday, April 25, 2022 10:38 PM
On Mon, 25 Apr 2022 11:10:14 +0100
Daniel P. Berrangé <berrange@redhat.com> wrote:
...
On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:
...
[Cc +libvirt folks]
On Thu, 14 Apr 2022 03:46:52 -0700
Yi Liu <yi.l.liu@intel.com> wrote:
...
With the introduction of iommufd[1], the linux kernel provides a  
generic  
interface for userspace drivers to propagate their DMA mappings to  
kernel  
for assigned devices. This series does the porting of the VFIO devices
onto the /dev/iommu uapi and let it coexist with the legacy  
implementation.  
Other devices like vpda, vfio mdev and etc. are not considered yet.
snip
...
...
The selection of the backend is made on a device basis using the new
iommufd option (on/off/auto). By default the iommufd backend is  
selected  
if supported by the host and by QEMU (iommufd KConfig). This option  
is  
currently available only for the vfio-pci device. For other types of
devices, it does not yet exist and the legacy BE is chosen by default.
I've discussed this a bit with Eric, but let me propose a different
command line interface.  Libvirt generally likes to pass file
descriptors to QEMU rather than grant it access to those files
directly.  This was problematic with vfio-pci because libvirt can't
easily know when QEMU will want to grab another /dev/vfio/vfio
container.  Therefore we abandoned this approach and instead libvirt
grants file permissions.
However, with iommufd there's no reason that QEMU ever needs more  
than  
a single instance of /dev/iommufd and we're using per device vfio file
descriptors, so it seems like a good time to revisit this.
I assume access to '/dev/iommufd' gives the process somewhat elevated
privileges, such that you don't want to unconditionally give QEMU
access to this device ?
It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged
interface which should have limited scope for abuse, but more so here
the goal would be to de-privilege QEMU that one step further that it
cannot open the device file itself.
...
...
The interface I was considering would be to add an iommufd object to
QEMU, so we might have a:
-device iommufd[,fd=#][,id=foo]
For non-libivrt usage this would have the ability to open /dev/iommufd
itself if an fd is not provided.  This object could be shared with
other iommufd users in the VM and maybe we'd allow multiple instances
for more esoteric use cases.  [NB, maybe this should be a -object rather  
than  
-device since the iommufd is not a guest visible device?]
Yes,  -object would be the right answer for something that's purely
a host side backend impl selector.
...
The vfio-pci device might then become:
-device vfio-  
pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f
oo]
So essentially we can specify the device via host, sysfsdev, or passing
an fd to the vfio device file.  When an iommufd object is specified,
"foo" in the example above, each of those options would use the
vfio-device access mechanism, essentially the same as iommufd=on in
your example.  With the fd passing option, an iommufd object would be
required and necessarily use device level access.
In your example, the iommufd=auto seems especially troublesome for
libvirt because QEMU is going to have different locked memory
requirements based on whether we're using type1 or iommufd, where  
the  
latter resolves the duplicate accounting issues.  libvirt needs to know
Based on current plan there is probably a transition window between the
point where the first vfio device type (vfio-pci) gaining iommufd support
and the point where all vfio types supporting iommufd. Libvirt can figure
out which one to use iommufd by checking the presence of
/dev/vfio/devices/vfioX. But what would be the resource limit policy
in Libvirt in such transition window when both type1 and iommufd might
be used? Or do we just expect Libvirt to support iommufd only after the
transition window ends to avoid handling such mess?
Good point regarding libvirt testing for the vfio device files for use
with iommufd, so libvirt would test if /dev/iommufd exists and if the
device they want to assign maps to a /dev/vfio/devices/vfioX file.  This
was essentially implicit in the fd=# option to the vfio-pci device.

In mixed combinations, I'd expect libvirt to continue to add the full
VM memory to the locked memory limit for each non-iommufd device added.
...
...
...
...
deterministically which backed is being used, which this proposal seems
to provide, while at the same time bringing us more in line with fd
passing.  Thoughts?  Thanks,
Yep, I agree that libvirt needs to have more direct control over this.
This is also even more important if there are notable feature differences
in the 2 backends.
I wonder if anyone has considered an even more distinct impl, whereby
we have a completely different device type on the backend, eg
-device vfio-iommu-  
pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f
oo]
If a vendor wants to fully remove the legacy impl, they can then use the
Kconfig mechanism to disable the build of the legacy impl device, while
keeping the iommu impl (or vica-verca if the new iommu impl isn't  
considered  
reliable enough for them to support yet).
Libvirt would use
-object iommu,id=iommu0,fd=NNN
   -device vfio-iommu-pci,fd=MMM,iommu=iommu0
Non-libvirt would use a simpler
-device vfio-iommu-pci,host=0000:03:22.1
with QEMU auto-creating a 'iommu' object in the background.
This would fit into libvirt's existing modelling better. We currently have
a concept of a PCI assignment backend, which previously supported the
legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl
feels like a 3rd PCI assignment approach, and so fits with how we modelled
it as a different device type in the past.
I don't think we want to conflate "iommu" and "iommufd", we're creating
an object that interfaces into the iommufd uAPI, not an iommu itself.
Likewise "vfio-iommu-pci" is just confusing, there was an iommu
interface previously, it's just a different implementation now and as
far as the VM interface to the device, it's identical.  Note that a
"vfio-iommufd-pci" device multiplies the matrix of every vfio device
for a rather subtle implementation detail.
My expectation would be that libvirt uses:
-object iommufd,id=iommufd0,fd=NNN
 -device vfio-pci,fd=MMM,iommufd=iommufd0
Whereas simple QEMU command line would be:
-object iommufd,id=iommufd0
 -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
The iommufd object would open /dev/iommufd itself.  Creating an
implicit iommufd object is someone problematic because one of the
things I forgot to highlight in my previous description is that the
iommufd object is meant to be shared across not only various vfio
devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
vdpa.
Out of curiosity - in concept one iommufd is sufficient to support all
ioas requirements across subsystems while having multiple iommufd's
instead lose the benefit of centralized accounting. The latter will also
cause some trouble when we start virtualizing ENQCMD which requires
VM-wide PASID virtualization thus further needs to share that 
information across iommufd's. Not unsolvable but really no gain by
adding such complexity. So I'm curious whether Qemu provide
a way to restrict that certain object type can only have one instance
to discourage such multi-iommufd attempt?
I don't see any reason for QEMU to restrict iommufd objects.  The QEMU
philosophy seems to be to let users create whatever configuration they
want.  For libvirt though, the assumption would be that a single
iommufd object can be used across subsystems, so libvirt would never
automatically create multiple objects.

We also need to be able to advise libvirt as to how each iommufd object
or user of that object factors into the VM locked memory requirement.
When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt
to set the locked memory limit to the size of VM RAM per iommufd,
regardless of the number of devices using a given iommufd.  However, I
don't know if all users of iommufd will be exclusively mapping VM RAM.
Combinations of devices where some map VM RAM and others map QEMU
buffer space could still require some incremental increase per device
(I'm not sure if vfio-nvme is such a device).  It seems like heuristics
will still be involved even after iommufd solves the per-device
vfio-pci locked memory limit issue.  Thanks,

Alex