Re: [RFC 00/18] vfio: Adopt iommufd

25 Apr 2022

      On Mon, 25 Apr 2022 22:23:05 +0200
Eric Auger <eric.auger@redhat.com> wrote:
...
Hi Alex,
On 4/23/22 12:09 AM, Alex Williamson wrote:
...
[Cc +libvirt folks]
On Thu, 14 Apr 2022 03:46:52 -0700
Yi Liu <yi.l.liu@intel.com> wrote:
...
With the introduction of iommufd[1], the linux kernel provides a generic
interface for userspace drivers to propagate their DMA mappings to kernel
for assigned devices. This series does the porting of the VFIO devices
onto the /dev/iommu uapi and let it coexist with the legacy implementation.
Other devices like vpda, vfio mdev and etc. are not considered yet.
For vfio devices, the new interface is tied with device fd and iommufd
as the iommufd solution is device-centric. This is different from legacy
vfio which is group-centric. To support both interfaces in QEMU, this
series introduces the iommu backend concept in the form of different
container classes. The existing vfio container is named legacy container
(equivalent with legacy iommu backend in this series), while the new
iommufd based container is named as iommufd container (may also be mentioned
as iommufd backend in this series). The two backend types have their own
way to setup secure context and dma management interface. Below diagram
shows how it looks like with both BEs.
VFIO                           AddressSpace/Memory
    +-------+  +----------+  +-----+  +-----+
    |  pci  |  | platform |  |  ap |  | ccw |
    +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
        |           |           |        |        |   AddressSpace       |
        |           |           |        |        +------------+---------+
    +---V-----------V-----------V--------V----+               /
    |           VFIOAddressSpace              | <------------+
    |                  |                      |  MemoryListener
    |          VFIOContainer list             |
    +-------+----------------------------+----+
            |                            |
            |                            |
    +-------V------+            +--------V----------+
    |   iommufd    |            |    vfio legacy    |
    |  container   |            |     container     |
    +-------+------+            +--------+----------+
            |                            |
            | /dev/iommu                 | /dev/vfio/vfio
            | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
 Userspace  |                            |
 ===========+============================+================================
 Kernel     |  device fd                 |
            +---------------+            | group/container fd
            | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
            |  ATTACH_IOAS) |            | device fd
            |               |            |
            |       +-------V------------V-----------------+
    iommufd |       |                vfio                  |
(map/unmap  |       +---------+--------------------+-------+
 ioas_copy) |                 |                    | map/unmap
            |                 |                    |
     +------V------+    +-----V------+      +------V--------+
     | iommfd core |    |  device    |      |  vfio iommu   |
     +-------------+    +------------+      +---------------+
[Secure Context setup]
- iommufd BE: uses device fd and iommufd to setup secure context
              (bind_iommufd, attach_ioas)
- vfio legacy BE: uses group fd and container fd to setup secure context
                  (set_container, set_iommu)
[Device access]
- iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
- vfio legacy BE: device fd is retrieved from group fd ioctl
[DMA Mapping flow]
- VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
- VFIO populates DMA map/unmap via the container BEs
  *) iommufd BE: uses iommufd
  *) vfio legacy BE: uses container fd
This series qomifies the VFIOContainer object which acts as a base class
for a container. This base class is derived into the legacy VFIO container
and the new iommufd based container. The base class implements generic code
such as code related to memory_listener and address space management whereas
the derived class implements callbacks that depend on the kernel user space
being used.
The selection of the backend is made on a device basis using the new
iommufd option (on/off/auto). By default the iommufd backend is selected
if supported by the host and by QEMU (iommufd KConfig). This option is
currently available only for the vfio-pci device. For other types of
devices, it does not yet exist and the legacy BE is chosen by default.  
I've discussed this a bit with Eric, but let me propose a different
command line interface.  Libvirt generally likes to pass file
descriptors to QEMU rather than grant it access to those files
directly.  This was problematic with vfio-pci because libvirt can't
easily know when QEMU will want to grab another /dev/vfio/vfio
container.  Therefore we abandoned this approach and instead libvirt
grants file permissions.
However, with iommufd there's no reason that QEMU ever needs more than
a single instance of /dev/iommufd and we're using per device vfio file
descriptors, so it seems like a good time to revisit this.
The interface I was considering would be to add an iommufd object to
QEMU, so we might have a:
-device iommufd[,fd=#][,id=foo]
For non-libivrt usage this would have the ability to open /dev/iommufd
itself if an fd is not provided.  This object could be shared with
other iommufd users in the VM and maybe we'd allow multiple instances
for more esoteric use cases.  [NB, maybe this should be a -object rather than
-device since the iommufd is not a guest visible device?]
The vfio-pci device might then become:
-device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
So essentially we can specify the device via host, sysfsdev, or passing
an fd to the vfio device file.  When an iommufd object is specified,
"foo" in the example above, each of those options would use the
vfio-device access mechanism, essentially the same as iommufd=on in
your example.  With the fd passing option, an iommufd object would be
required and necessarily use device level access.  
What is the use case you foresee for the "fd=#" option?
On the vfio-pci device this was intended to be the actual vfio device
file descriptor.  Once we have a file per device, QEMU doesn't really
have any need to navigate through sysfs to determine which fd to use
other than for user convenience on the command line.  For libvirt usage,
I assume QEMU could accept the device fd, without ever really knowing
anything about the host address or sysfs path of the device.
...
...
In your example, the iommufd=auto seems especially troublesome for
libvirt because QEMU is going to have different locked memory
requirements based on whether we're using type1 or iommufd, where the
latter resolves the duplicate accounting issues.  libvirt needs to know
deterministically which backed is being used, which this proposal seems
to provide, while at the same time bringing us more in line with fd
passing.  Thoughts?  Thanks,
I like your proposal (based on the -object iommufd). The only thing that
may be missing I think is for a qemu end-user who actually does not care
about the iommu backend being used but just wishes to use the most
recent available one it adds some extra complexity. But this is not the
most important use case ;)
Yeah, I can sympathize with that, but isn't that also why we're pursing
a vfio compatibility interface at the kernel level?  Eventually, once
the native vfio IOMMU backends go away, the vfio "container" device
file will be provided by iommufd and that transition to the new
interface can be both seamless to the user and apparent to tools like
libvirt.

An end-user with a fixed command line should continue to work and will
eventually get iommufd via compatibility, but taking care of an
end-user that "does not care" and "wishes to use the most recent" is a
non-goal for me.  That would be more troublesome for tools and use cases
that we do care about imo.  Thanks,

Alex