On Fri, Sep 23, 2022 at 11:40:51AM -0400, Laine Stump wrote:
It's been a few years, but my recollection is that before
starting a
libvirtd that will run a guest with a vfio device, a privileged process
needs to
1) increase the locked memory limit for the user that will be running qemu
(eg. by adding a file with the increased limit to /etc/security/limits.d)
2) bind the device to the vfio-pci driver, and
3) chown /dev/vfio/$iommu_group to the user running qemu.
Here is what is going on to resolve this:
1) iommufd internally supports two ways to account ulimits, the vfio
way and the io_uring way. Each FD operates in its own mode.
When /dev/iommu is opened the FD defaults to the io_uring way, when
/dev/vfio/vfio is opened it uses the VFIO way. This means
/dev/vfio/vfio is not a symlink, there is a new kconfig
now to make iommufd directly provide a miscdev.
2) There is an ioctl IOMMU_OPTION_RLIMIT_MODE which allows a
privileged user to query/set which mode the FD will run in.
The idea is that libvirt will open iommufd, the first action will
be to set vfio compat mode, and then it will fd pass the fd to
qemu and qemu will operate in the correct sandbox.
3) We are working on a cgroup for FOLL_LONGTERM, it is a big job but
this should prove a comprehensive resolution to this problem across
the kernel and improve the qemu sandbox security.
Still TBD, but most likely when the cgroup supports this libvirt
would set the rlimit to unlimited, then set new mlock and
FOLL_LONGTERM cgroup limits to create the sandbox.
Jason