On 9/21/2022 2:44 PM, Jason Gunthorpe wrote:
On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
>> I still think the compat gaps are small. I've realized that
>> VFIO_DMA_UNMAP_FLAG_VADDR has no implementation in qemu, and since it
>> can deadlock the kernel I propose we purge it completely.
>
> Steve won't be happy to hear that, QEMU support exists but isn't yet
> merged.
"unhappy" barely scratches the surface of my feelings!
Live update is a great feature that solves a real problem, and lots of
people have spent lots of time providing thorough feedback on the qemu
patches I have submitted. We *will* cross the finish line. In the
mean time, I maintain a patched qemu for use in my company, and I have
heard from others who do the same.
If Steve wants to keep it then someone needs to fix the deadlock in
the vfio implementation before any userspace starts to appear.
The only VFIO_DMA_UNMAP_FLAG_VADDR issue I am aware of is broken pinned accounting
across exec, which can result in mm->locked_vm becoming negative. I have several
fixes, but none result in limits being reached at exactly the same time as before --
the same general issue being discussed for iommufd. I am still thinking about it.
I am not aware of a deadlock problem. Please elaborate or point me to an
email thread.
I can fix the deadlock in iommufd in a terrible expensive way, but
would rather we design a better interface if nobody is using it yet. I
advocate for passing the memfd to the kernel and use that as the page
provider, not a mm_struct.
memfd support alone is not sufficient. Live update also supports guest ram
backed by named shared memory.
- Steve
> The issue is where we account these pinned pages, where
accounting is
> necessary such that a user cannot lock an arbitrary number of pages
> into RAM to generate a DoS attack.
It is worth pointing out that preventing a DOS attack doesn't actually
work because a *task* limit is trivially bypassed by just spawning
more tasks. So, as a security feature, this is already very
questionable.
What we've done here is make the security feature work to actually
prevent DOS attacks, which then gives you this problem:
> This obviously has implications. AFAICT, any management tool that
> doesn't instantiate assigned device VMs under separate users are
> essentially untenable.
Because now that the security feature works properly it detects the
DOS created by spawning multiple tasks :(
Somehow I was under the impression there was not user sharing in the
common cases, but I guess I don't know that for sure.
>> So, I still like 2 because it yields the smallest next step before we
>> can bring all the parallel work onto the list, and it makes testing
>> and converting non-qemu stuff easier even going forward.
>
> If a vfio compatible interface isn't transparently compatible, then I
> have a hard time understanding its value. Please correct my above
> description and implications, but I suspect these are not just
> theoretical ABI compat issues. Thanks,
Because it is just fine for everything that doesn't use the ulimit
feature, which is still a lot of use cases!
Remember, at this point we are not replacing /dev/vfio/vfio, this is
just providing the general compat in a form that has to be opted
into. I think if you open the /dev/iommu device node then you should
get secured accounting.
If /dev/vfio/vfio is provided by iommufd it may well have to trigger a
different ulimit tracking - if that is the only sticking point it
seems minor and should be addressed in some later series that adds
/dev/vfio/vfio support to iommufd..
Jason