Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface

10 Oct 2022

      On 10/6/2022 12:01 PM, Jason Gunthorpe wrote:
...
On Wed, Sep 21, 2022 at 08:09:54PM -0300, Jason Gunthorpe wrote:
...
On Wed, Sep 21, 2022 at 03:30:55PM -0400, Steven Sistare wrote:
...
...
If Steve wants to keep it then someone needs to fix the deadlock in
the vfio implementation before any userspace starts to appear.
The only VFIO_DMA_UNMAP_FLAG_VADDR issue I am aware of is broken pinned accounting
across exec, which can result in mm->locked_vm becoming negative. I have several 
fixes, but none result in limits being reached at exactly the same time as before --
the same general issue being discussed for iommufd.  I am still thinking about it.
Oh, yeah, I noticed this was all busted up too.
...
I am not aware of a deadlock problem.  Please elaborate or point me to an
email thread.
VFIO_DMA_UNMAP_FLAG_VADDR open codes a lock in the kernel where
userspace can tigger the lock to be taken and then returns to
userspace with the lock held.
Any scenario where a kernel thread hits that open-coded lock and then
userspace does-the-wrong-thing will deadlock the kernel.
For instance consider a mdev driver. We assert
VFIO_DMA_UNMAP_FLAG_VADDR, the mdev driver does a DMA in a workqueue
and becomes blocked on the now locked lock. Userspace then tries to
close the device FD.
FD closure will trigger device close and the VFIO core code
requirement is that mdev driver device teardown must halt all
concurrent threads touching vfio_device. Thus the mdev will try to
fence its workqeue and then deadlock - unable to flush/cancel a work
that is currently blocked on a lock held by userspace that will never
be unlocked.
This is just the first scenario that comes to mind. The approach to
give userspace control of a lock that kernel threads can become
blocked on is so completely sketchy it is a complete no-go in my
opinion. If I had seen it when it was posted I would have hard NAK'd
it.
My "full" solution in mind for iommufd is to pin all the memory upon
VFIO_DMA_UNMAP_FLAG_VADDR, so we can continue satisfy DMA requests
while the mm_struct is not available. But IMHO this is basically
useless for any actual user of mdevs.
The other option is to just exclude mdevs and fail the
VFIO_DMA_UNMAP_FLAG_VADDR if any are present, then prevent them from
becoming present while it is asserted. In this way we don't need to do
anything beyond a simple check as the iommu_domain is already fully
populated and pinned.
Do we have a solution to this?
If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR
Aside from the approach to use the FD, another idea is to just use
fork.
qemu would do something like
.. stop all container ioctl activity ..
 fork()
    ioctl(CHANGE_MM) // switch all maps to this mm
    .. signal parent.. 
    .. wait parent..
    exit(0)
 .. wait child ..
 exec()
 ioctl(CHANGE_MM) // switch all maps to this mm
 ..signal child..
 waitpid(childpid)
This way the kernel is never left without a page provider for the
maps, the dummy mm_struct belonging to the fork will serve that role
for the gap.
And the above is only required if we have mdevs, so we could imagine
userspace optimizing it away for, eg vfio-pci only cases.
It is not as efficient as using a FD backing, but this is super easy
to implement in the kernel.
I propose to avoid deadlock for mediated devices as follows.  Currently, an
mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR
is asserted.

  * In vfio_wait, I will maintain a list of waiters, each list element
    consisting of (task, mdev, close_flag=false).

  * When the vfio device descriptor is closed, vfio_device_fops_release
    will notify the vfio_iommu driver, which will find the mdev on the waiters
    list, set elem->close_flag=true, and call wake_up_process for the task.

  * The task will wake in vfio_wait, see close_flag=true, and return EFAULT
    to the mdev caller.

This requires a little new plumbing.  I will work out the details, but if you
see a problem with the overall approach, please let me know.

- Steve