
On Tue, Oct 11, 2022 at 04:30:58PM -0400, Steven Sistare wrote:
On 10/11/2022 8:30 AM, Jason Gunthorpe wrote:
On Mon, Oct 10, 2022 at 04:54:50PM -0400, Steven Sistare wrote:
Do we have a solution to this?
If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR
Aside from the approach to use the FD, another idea is to just use fork.
qemu would do something like
.. stop all container ioctl activity .. fork() ioctl(CHANGE_MM) // switch all maps to this mm .. signal parent.. .. wait parent.. exit(0) .. wait child .. exec() ioctl(CHANGE_MM) // switch all maps to this mm ..signal child.. waitpid(childpid)
This way the kernel is never left without a page provider for the maps, the dummy mm_struct belonging to the fork will serve that role for the gap.
And the above is only required if we have mdevs, so we could imagine userspace optimizing it away for, eg vfio-pci only cases.
It is not as efficient as using a FD backing, but this is super easy to implement in the kernel.
I propose to avoid deadlock for mediated devices as follows. Currently, an mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR is asserted.
* In vfio_wait, I will maintain a list of waiters, each list element consisting of (task, mdev, close_flag=false).
* When the vfio device descriptor is closed, vfio_device_fops_release will notify the vfio_iommu driver, which will find the mdev on the waiters list, set elem->close_flag=true, and call wake_up_process for the task.
This alone is not sufficient, the mdev driver can continue to establish new mappings until it's close_device function returns. Killing only existing mappings is racy.
I think you are focusing on the one issue I pointed at, as I said, I'm sure there are more ways than just close to abuse this functionality to deadlock the kernel.
I continue to prefer we remove it completely and do something more robust. I suggested two options.
It's not racy. New pin requests also land in vfio_wait if any vaddr's have been invalidated in any vfio_dma in the iommu. See vfio_iommu_type1_pin_pages() if (iommu->vaddr_invalid_count) vfio_find_dma_valid() vfio_wait()
I mean you can't do a one shot wakeup of only existing waiters, and you can't corrupt the container to wake up waiters for other devices, so I don't see how this can be made to work safely... It also doesn't solve any flow that doesn't trigger file close, like a process thread being stuck on the wait in the kernel. eg because a trapped mmio triggered an access or something. So it doesn't seem like a workable direction to me.
However, I will investigate saving a reference to the file object in the vfio_dma (for mappings backed by a file) and using that to translate IOVA's.
It is certainly the best flow, but it may be difficult. Eg the memfd work for KVM to do something similar is quite involved.
I think that will be easier to use than fork/CHANGE_MM/exec, and may even be easier to use than VFIO_DMA_UNMAP_FLAG_VADDR. To be continued.
Yes, certainly easier to use, I suggested CHANGE_MM because the kernel implementation is very easy, I could send you something to test w/ iommufd in a few hours effort probably. Anyhow, I think this conversation has convinced me there is no way to fix VFIO_DMA_UNMAP_FLAG_VADDR. I'll send a patch reverting it due to it being a security bug, basically. Jason