Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface

12 Oct 2022

      On Tue, Oct 11, 2022 at 04:30:58PM -0400, Steven Sistare wrote:
...
On 10/11/2022 8:30 AM, Jason Gunthorpe wrote:
...
On Mon, Oct 10, 2022 at 04:54:50PM -0400, Steven Sistare wrote:
...
...
Do we have a solution to this?
If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR
Aside from the approach to use the FD, another idea is to just use
fork.
qemu would do something like
.. stop all container ioctl activity ..
 fork()
    ioctl(CHANGE_MM) // switch all maps to this mm
    .. signal parent.. 
    .. wait parent..
    exit(0)
 .. wait child ..
 exec()
 ioctl(CHANGE_MM) // switch all maps to this mm
 ..signal child..
 waitpid(childpid)
This way the kernel is never left without a page provider for the
maps, the dummy mm_struct belonging to the fork will serve that role
for the gap.
And the above is only required if we have mdevs, so we could imagine
userspace optimizing it away for, eg vfio-pci only cases.
It is not as efficient as using a FD backing, but this is super easy
to implement in the kernel.
I propose to avoid deadlock for mediated devices as follows.  Currently, an
mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR
is asserted.
* In vfio_wait, I will maintain a list of waiters, each list element
    consisting of (task, mdev, close_flag=false).
* When the vfio device descriptor is closed, vfio_device_fops_release
    will notify the vfio_iommu driver, which will find the mdev on the waiters
    list, set elem->close_flag=true, and call wake_up_process for the task.
This alone is not sufficient, the mdev driver can continue to
establish new mappings until it's close_device function
returns. Killing only existing mappings is racy.
I think you are focusing on the one issue I pointed at, as I said, I'm
sure there are more ways than just close to abuse this functionality
to deadlock the kernel.
I continue to prefer we remove it completely and do something more
robust. I suggested two options.
It's not racy.  New pin requests also land in vfio_wait if any vaddr's have
been invalidated in any vfio_dma in the iommu.  See
  vfio_iommu_type1_pin_pages()
    if (iommu->vaddr_invalid_count)
      vfio_find_dma_valid()
        vfio_wait()
I mean you can't do a one shot wakeup of only existing waiters, and
you can't corrupt the container to wake up waiters for other devices,
so I don't see how this can be made to work safely...

It also doesn't solve any flow that doesn't trigger file close, like a
process thread being stuck on the wait in the kernel. eg because a
trapped mmio triggered an access or something.

So it doesn't seem like a workable direction to me.
...
However, I will investigate saving a reference to the file object in
the vfio_dma (for mappings backed by a file) and using that to
translate IOVA's.
It is certainly the best flow, but it may be difficult. Eg the memfd
work for KVM to do something similar is quite involved.
...
I think that will be easier to use than fork/CHANGE_MM/exec, and may
even be easier to use than VFIO_DMA_UNMAP_FLAG_VADDR.  To be
continued.
Yes, certainly easier to use, I suggested CHANGE_MM because the kernel
implementation is very easy, I could send you something to test w/
iommufd in a few hours effort probably.

Anyhow, I think this conversation has convinced me there is no way to
fix VFIO_DMA_UNMAP_FLAG_VADDR. I'll send a patch reverting it due to
it being a security bug, basically.

Jason