On 10/11/2022 8:30 AM, Jason Gunthorpe wrote:
On Mon, Oct 10, 2022 at 04:54:50PM -0400, Steven Sistare wrote:
>> Do we have a solution to this?
>>
>> If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR
>>
>> Aside from the approach to use the FD, another idea is to just use
>> fork.
>>
>> qemu would do something like
>>
>> .. stop all container ioctl activity ..
>> fork()
>> ioctl(CHANGE_MM) // switch all maps to this mm
>> .. signal parent..
>> .. wait parent..
>> exit(0)
>> .. wait child ..
>> exec()
>> ioctl(CHANGE_MM) // switch all maps to this mm
>> ..signal child..
>> waitpid(childpid)
>>
>> This way the kernel is never left without a page provider for the
>> maps, the dummy mm_struct belonging to the fork will serve that role
>> for the gap.
>>
>> And the above is only required if we have mdevs, so we could imagine
>> userspace optimizing it away for, eg vfio-pci only cases.
>>
>> It is not as efficient as using a FD backing, but this is super easy
>> to implement in the kernel.
>
> I propose to avoid deadlock for mediated devices as follows. Currently, an
> mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR
> is asserted.
>
> * In vfio_wait, I will maintain a list of waiters, each list element
> consisting of (task, mdev, close_flag=false).
>
> * When the vfio device descriptor is closed, vfio_device_fops_release
> will notify the vfio_iommu driver, which will find the mdev on the waiters
> list, set elem->close_flag=true, and call wake_up_process for the task.
This alone is not sufficient, the mdev driver can continue to
establish new mappings until it's close_device function
returns. Killing only existing mappings is racy.
I think you are focusing on the one issue I pointed at, as I said, I'm
sure there are more ways than just close to abuse this functionality
to deadlock the kernel.
I continue to prefer we remove it completely and do something more
robust. I suggested two options.
It's not racy. New pin requests also land in vfio_wait if any vaddr's have
been invalidated in any vfio_dma in the iommu. See
vfio_iommu_type1_pin_pages()
if (iommu->vaddr_invalid_count)
vfio_find_dma_valid()
vfio_wait()
However, I will investigate saving a reference to the file object in the vfio_dma
(for mappings backed by a file) and using that to translate IOVA's. I think that
will be easier to use than fork/CHANGE_MM/exec, and may even be easier to use
than VFIO_DMA_UNMAP_FLAG_VADDR. To be continued.
- Steve