Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface

[Cc+ Steve, libvirt, Daniel, Laine] On Tue, 20 Sep 2022 16:56:42 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote:
On Tue, Sep 13, 2022 at 09:28:18AM +0200, Eric Auger wrote:
Hi,
On 9/13/22 03:55, Tian, Kevin wrote:
We didn't close the open of how to get this merged in LPC due to the audio issue. Then let's use mails.
Overall there are three options on the table:
1) Require vfio-compat to be 100% compatible with vfio-type1
Probably not a good choice given the amount of work to fix the remaining gaps. And this will block support of new IOMMU features for a longer time.
2) Leave vfio-compat as what it is in this series
Treat it as a vehicle to validate the iommufd logic instead of immediately replacing vfio-type1. Functionally most vfio applications can work w/o change if putting aside the difference on locked mm accounting, p2p, etc.
Then work on new features and 100% vfio-type1 compat. in parallel.
3) Focus on iommufd native uAPI first
Require vfio_device cdev and adoption in Qemu. Only for new vfio app.
Then work on new features and vfio-compat in parallel.
I'm fine with either 2) or 3). Per a quick chat with Alex he prefers to 3).
I am also inclined to pursue 3) as this was the initial Jason's guidance and pre-requisite to integrate new features. In the past we concluded vfio-compat would mostly be used for testing purpose. Our QEMU integration fully is based on device based API.
There are some poor chicken and egg problems here.
I had some assumptions: a - the vfio cdev model is going to be iommufd only b - any uAPI we add as we go along should be generally useful going forward c - we should try to minimize the 'minimally viable iommufd' series
The compat as it stands now (eg #2) is threading this needle. Since it can exist without cdev it means (c) is made smaller, to two series.
Since we add something useful to some use cases, eg DPDK is deployable that way, (b) is OK.
If we focus on a strict path with 3, and avoid adding non-useful code, then we have to have two more (unwritten!) series beyond where we are now - vfio group compartmentalization, and cdev integration, and the initial (c) will increase.
3 also has us merging something that currently has no usable userspace, which I also do dislike alot.
I still think the compat gaps are small. I've realized that VFIO_DMA_UNMAP_FLAG_VADDR has no implementation in qemu, and since it can deadlock the kernel I propose we purge it completely.
Steve won't be happy to hear that, QEMU support exists but isn't yet merged.
P2P is ongoing.
That really just leaves the accounting, and I'm still not convinced at this must be a critical thing. Linus's latest remarks reported in lwn at the maintainer summit on tracepoints/BPF as ABI seem to support this. Let's see an actual deployed production configuration that would be impacted, and we won't find that unless we move forward.
I'll try to summarize the proposed change so that we can get better advice from libvirt folks, or potentially anyone else managing locked memory limits for device assignment VMs. Background: when a DMA range, ex. guest RAM, is mapped to a vfio device, we use the system IOMMU to provide GPA to HPA translation for assigned devices. Unlike CPU page tables, we don't generally have a means to demand fault these translations, therefore the memory target of the translation is pinned to prevent that it cannot be swapped or relocated, ie. to guarantee the translation is always valid. The issue is where we account these pinned pages, where accounting is necessary such that a user cannot lock an arbitrary number of pages into RAM to generate a DoS attack. Duplicate accounting should be resolved by iommufd, but is outside the scope of this discussion. Currently, vfio tests against the mm_struct.locked_vm relative to rlimit(RLIMIT_MEMLOCK), which reads task->signal->rlim[limit].rlim_cur, where task is the current process. This is the same limit set via the setrlimit syscall used by prlimit(1) and reported via 'ulimit -l'. Note that in both cases above, we're dealing with a task, or process limit and both prlimit and ulimit man pages describe them as such. iommufd supposes instead, and references existing kernel implementations, that despite the descriptions above these limits are actually meant to be user limits and therefore instead charges pinned pages against user_struct.locked_vm and also marks them in mm_struct.pinned_vm. The proposed algorithm is to read the _task_ locked memory limit, then attempt to charge the _user_ locked_vm, such that user_struct.locked_vm cannot exceed the task locked memory limit. This obviously has implications. AFAICT, any management tool that doesn't instantiate assigned device VMs under separate users are essentially untenable. For example, if we launch VM1 under userA and set a locked memory limit of 4GB via prlimit to account for an assigned device, that works fine, until we launch VM2 from userA as well. In that case we can't simply set a 4GB limit on the VM2 task because there's already 4GB charged against user_struct.locked_vm for VM1. So we'd need to set the VM2 task limit to 8GB to be able to launch VM2. But not only that, we'd need to go back and also set VM1's task limit to 8GB or else it will fail if a DMA mapped memory region is transient and needs to be re-mapped. Effectively any task under the same user and requiring pinned memory needs to have a locked memory limit set, and updated, to account for all tasks using pinned memory by that user. How does this affect known current use cases of locked memory management for assigned device VMs? Does qemu://system by default sandbox into per VM uids or do they all use the qemu user by default. I imagine qemu://session mode is pretty screwed by this, but I also don't know who/where locked limits are lifted for such VMs. Boxes, who I think now supports assigned device VMs, could also be affected.
So, I still like 2 because it yields the smallest next step before we can bring all the parallel work onto the list, and it makes testing and converting non-qemu stuff easier even going forward.
If a vfio compatible interface isn't transparently compatible, then I have a hard time understanding its value. Please correct my above description and implications, but I suspect these are not just theoretical ABI compat issues. Thanks, Alex

On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
I still think the compat gaps are small. I've realized that VFIO_DMA_UNMAP_FLAG_VADDR has no implementation in qemu, and since it can deadlock the kernel I propose we purge it completely.
Steve won't be happy to hear that, QEMU support exists but isn't yet merged.
If Steve wants to keep it then someone needs to fix the deadlock in the vfio implementation before any userspace starts to appear. I can fix the deadlock in iommufd in a terrible expensive way, but would rather we design a better interface if nobody is using it yet. I advocate for passing the memfd to the kernel and use that as the page provider, not a mm_struct.
The issue is where we account these pinned pages, where accounting is necessary such that a user cannot lock an arbitrary number of pages into RAM to generate a DoS attack.
It is worth pointing out that preventing a DOS attack doesn't actually work because a *task* limit is trivially bypassed by just spawning more tasks. So, as a security feature, this is already very questionable. What we've done here is make the security feature work to actually prevent DOS attacks, which then gives you this problem:
This obviously has implications. AFAICT, any management tool that doesn't instantiate assigned device VMs under separate users are essentially untenable.
Because now that the security feature works properly it detects the DOS created by spawning multiple tasks :( Somehow I was under the impression there was not user sharing in the common cases, but I guess I don't know that for sure.
So, I still like 2 because it yields the smallest next step before we can bring all the parallel work onto the list, and it makes testing and converting non-qemu stuff easier even going forward.
If a vfio compatible interface isn't transparently compatible, then I have a hard time understanding its value. Please correct my above description and implications, but I suspect these are not just theoretical ABI compat issues. Thanks,
Because it is just fine for everything that doesn't use the ulimit feature, which is still a lot of use cases! Remember, at this point we are not replacing /dev/vfio/vfio, this is just providing the general compat in a form that has to be opted into. I think if you open the /dev/iommu device node then you should get secured accounting. If /dev/vfio/vfio is provided by iommufd it may well have to trigger a different ulimit tracking - if that is the only sticking point it seems minor and should be addressed in some later series that adds /dev/vfio/vfio support to iommufd.. Jason

On 9/21/2022 2:44 PM, Jason Gunthorpe wrote:
On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
I still think the compat gaps are small. I've realized that VFIO_DMA_UNMAP_FLAG_VADDR has no implementation in qemu, and since it can deadlock the kernel I propose we purge it completely.
Steve won't be happy to hear that, QEMU support exists but isn't yet merged.
"unhappy" barely scratches the surface of my feelings! Live update is a great feature that solves a real problem, and lots of people have spent lots of time providing thorough feedback on the qemu patches I have submitted. We *will* cross the finish line. In the mean time, I maintain a patched qemu for use in my company, and I have heard from others who do the same.
If Steve wants to keep it then someone needs to fix the deadlock in the vfio implementation before any userspace starts to appear.
The only VFIO_DMA_UNMAP_FLAG_VADDR issue I am aware of is broken pinned accounting across exec, which can result in mm->locked_vm becoming negative. I have several fixes, but none result in limits being reached at exactly the same time as before -- the same general issue being discussed for iommufd. I am still thinking about it. I am not aware of a deadlock problem. Please elaborate or point me to an email thread.
I can fix the deadlock in iommufd in a terrible expensive way, but would rather we design a better interface if nobody is using it yet. I advocate for passing the memfd to the kernel and use that as the page provider, not a mm_struct.
memfd support alone is not sufficient. Live update also supports guest ram backed by named shared memory. - Steve
The issue is where we account these pinned pages, where accounting is necessary such that a user cannot lock an arbitrary number of pages into RAM to generate a DoS attack.
It is worth pointing out that preventing a DOS attack doesn't actually work because a *task* limit is trivially bypassed by just spawning more tasks. So, as a security feature, this is already very questionable.
What we've done here is make the security feature work to actually prevent DOS attacks, which then gives you this problem:
This obviously has implications. AFAICT, any management tool that doesn't instantiate assigned device VMs under separate users are essentially untenable.
Because now that the security feature works properly it detects the DOS created by spawning multiple tasks :(
Somehow I was under the impression there was not user sharing in the common cases, but I guess I don't know that for sure.
So, I still like 2 because it yields the smallest next step before we can bring all the parallel work onto the list, and it makes testing and converting non-qemu stuff easier even going forward.
If a vfio compatible interface isn't transparently compatible, then I have a hard time understanding its value. Please correct my above description and implications, but I suspect these are not just theoretical ABI compat issues. Thanks,
Because it is just fine for everything that doesn't use the ulimit feature, which is still a lot of use cases!
Remember, at this point we are not replacing /dev/vfio/vfio, this is just providing the general compat in a form that has to be opted into. I think if you open the /dev/iommu device node then you should get secured accounting.
If /dev/vfio/vfio is provided by iommufd it may well have to trigger a different ulimit tracking - if that is the only sticking point it seems minor and should be addressed in some later series that adds /dev/vfio/vfio support to iommufd..
Jason

On Wed, Sep 21, 2022 at 03:30:55PM -0400, Steven Sistare wrote:
If Steve wants to keep it then someone needs to fix the deadlock in the vfio implementation before any userspace starts to appear.
The only VFIO_DMA_UNMAP_FLAG_VADDR issue I am aware of is broken pinned accounting across exec, which can result in mm->locked_vm becoming negative. I have several fixes, but none result in limits being reached at exactly the same time as before -- the same general issue being discussed for iommufd. I am still thinking about it.
Oh, yeah, I noticed this was all busted up too.
I am not aware of a deadlock problem. Please elaborate or point me to an email thread.
VFIO_DMA_UNMAP_FLAG_VADDR open codes a lock in the kernel where userspace can tigger the lock to be taken and then returns to userspace with the lock held. Any scenario where a kernel thread hits that open-coded lock and then userspace does-the-wrong-thing will deadlock the kernel. For instance consider a mdev driver. We assert VFIO_DMA_UNMAP_FLAG_VADDR, the mdev driver does a DMA in a workqueue and becomes blocked on the now locked lock. Userspace then tries to close the device FD. FD closure will trigger device close and the VFIO core code requirement is that mdev driver device teardown must halt all concurrent threads touching vfio_device. Thus the mdev will try to fence its workqeue and then deadlock - unable to flush/cancel a work that is currently blocked on a lock held by userspace that will never be unlocked. This is just the first scenario that comes to mind. The approach to give userspace control of a lock that kernel threads can become blocked on is so completely sketchy it is a complete no-go in my opinion. If I had seen it when it was posted I would have hard NAK'd it. My "full" solution in mind for iommufd is to pin all the memory upon VFIO_DMA_UNMAP_FLAG_VADDR, so we can continue satisfy DMA requests while the mm_struct is not available. But IMHO this is basically useless for any actual user of mdevs. The other option is to just exclude mdevs and fail the VFIO_DMA_UNMAP_FLAG_VADDR if any are present, then prevent them from becoming present while it is asserted. In this way we don't need to do anything beyond a simple check as the iommu_domain is already fully populated and pinned.
I can fix the deadlock in iommufd in a terrible expensive way, but would rather we design a better interface if nobody is using it yet. I advocate for passing the memfd to the kernel and use that as the page provider, not a mm_struct.
memfd support alone is not sufficient. Live update also supports guest ram backed by named shared memory.
By "memfd" I broadly mean whatever FD based storage you want to use: shmem, hugetlbfs, whatever. Just not a mm_struct. The point is we satisfy the page requests through the fd based object, not through a vma in the mm_struct. Jason

On Wed, Sep 21, 2022 at 08:09:54PM -0300, Jason Gunthorpe wrote:
On Wed, Sep 21, 2022 at 03:30:55PM -0400, Steven Sistare wrote:
If Steve wants to keep it then someone needs to fix the deadlock in the vfio implementation before any userspace starts to appear.
The only VFIO_DMA_UNMAP_FLAG_VADDR issue I am aware of is broken pinned accounting across exec, which can result in mm->locked_vm becoming negative. I have several fixes, but none result in limits being reached at exactly the same time as before -- the same general issue being discussed for iommufd. I am still thinking about it.
Oh, yeah, I noticed this was all busted up too.
I am not aware of a deadlock problem. Please elaborate or point me to an email thread.
VFIO_DMA_UNMAP_FLAG_VADDR open codes a lock in the kernel where userspace can tigger the lock to be taken and then returns to userspace with the lock held.
Any scenario where a kernel thread hits that open-coded lock and then userspace does-the-wrong-thing will deadlock the kernel.
For instance consider a mdev driver. We assert VFIO_DMA_UNMAP_FLAG_VADDR, the mdev driver does a DMA in a workqueue and becomes blocked on the now locked lock. Userspace then tries to close the device FD.
FD closure will trigger device close and the VFIO core code requirement is that mdev driver device teardown must halt all concurrent threads touching vfio_device. Thus the mdev will try to fence its workqeue and then deadlock - unable to flush/cancel a work that is currently blocked on a lock held by userspace that will never be unlocked.
This is just the first scenario that comes to mind. The approach to give userspace control of a lock that kernel threads can become blocked on is so completely sketchy it is a complete no-go in my opinion. If I had seen it when it was posted I would have hard NAK'd it.
My "full" solution in mind for iommufd is to pin all the memory upon VFIO_DMA_UNMAP_FLAG_VADDR, so we can continue satisfy DMA requests while the mm_struct is not available. But IMHO this is basically useless for any actual user of mdevs.
The other option is to just exclude mdevs and fail the VFIO_DMA_UNMAP_FLAG_VADDR if any are present, then prevent them from becoming present while it is asserted. In this way we don't need to do anything beyond a simple check as the iommu_domain is already fully populated and pinned.
Do we have a solution to this? If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR Aside from the approach to use the FD, another idea is to just use fork. qemu would do something like .. stop all container ioctl activity .. fork() ioctl(CHANGE_MM) // switch all maps to this mm .. signal parent.. .. wait parent.. exit(0) .. wait child .. exec() ioctl(CHANGE_MM) // switch all maps to this mm ..signal child.. waitpid(childpid) This way the kernel is never left without a page provider for the maps, the dummy mm_struct belonging to the fork will serve that role for the gap. And the above is only required if we have mdevs, so we could imagine userspace optimizing it away for, eg vfio-pci only cases. It is not as efficient as using a FD backing, but this is super easy to implement in the kernel. Jason

On 10/6/2022 12:01 PM, Jason Gunthorpe wrote:
On Wed, Sep 21, 2022 at 08:09:54PM -0300, Jason Gunthorpe wrote:
On Wed, Sep 21, 2022 at 03:30:55PM -0400, Steven Sistare wrote:
If Steve wants to keep it then someone needs to fix the deadlock in the vfio implementation before any userspace starts to appear.
The only VFIO_DMA_UNMAP_FLAG_VADDR issue I am aware of is broken pinned accounting across exec, which can result in mm->locked_vm becoming negative. I have several fixes, but none result in limits being reached at exactly the same time as before -- the same general issue being discussed for iommufd. I am still thinking about it.
Oh, yeah, I noticed this was all busted up too.
I am not aware of a deadlock problem. Please elaborate or point me to an email thread.
VFIO_DMA_UNMAP_FLAG_VADDR open codes a lock in the kernel where userspace can tigger the lock to be taken and then returns to userspace with the lock held.
Any scenario where a kernel thread hits that open-coded lock and then userspace does-the-wrong-thing will deadlock the kernel.
For instance consider a mdev driver. We assert VFIO_DMA_UNMAP_FLAG_VADDR, the mdev driver does a DMA in a workqueue and becomes blocked on the now locked lock. Userspace then tries to close the device FD.
FD closure will trigger device close and the VFIO core code requirement is that mdev driver device teardown must halt all concurrent threads touching vfio_device. Thus the mdev will try to fence its workqeue and then deadlock - unable to flush/cancel a work that is currently blocked on a lock held by userspace that will never be unlocked.
This is just the first scenario that comes to mind. The approach to give userspace control of a lock that kernel threads can become blocked on is so completely sketchy it is a complete no-go in my opinion. If I had seen it when it was posted I would have hard NAK'd it.
My "full" solution in mind for iommufd is to pin all the memory upon VFIO_DMA_UNMAP_FLAG_VADDR, so we can continue satisfy DMA requests while the mm_struct is not available. But IMHO this is basically useless for any actual user of mdevs.
The other option is to just exclude mdevs and fail the VFIO_DMA_UNMAP_FLAG_VADDR if any are present, then prevent them from becoming present while it is asserted. In this way we don't need to do anything beyond a simple check as the iommu_domain is already fully populated and pinned.
Do we have a solution to this?
Not yet, but I have not had time until now. Let me try some things tomorrow and get back to you. Thanks for thinking about it. - Steve
If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR
Aside from the approach to use the FD, another idea is to just use fork.
qemu would do something like
.. stop all container ioctl activity .. fork() ioctl(CHANGE_MM) // switch all maps to this mm .. signal parent.. .. wait parent.. exit(0) .. wait child .. exec() ioctl(CHANGE_MM) // switch all maps to this mm ..signal child.. waitpid(childpid)
This way the kernel is never left without a page provider for the maps, the dummy mm_struct belonging to the fork will serve that role for the gap.
And the above is only required if we have mdevs, so we could imagine userspace optimizing it away for, eg vfio-pci only cases.
It is not as efficient as using a FD backing, but this is super easy to implement in the kernel.
Jason

On 10/6/2022 12:01 PM, Jason Gunthorpe wrote:
On Wed, Sep 21, 2022 at 08:09:54PM -0300, Jason Gunthorpe wrote:
On Wed, Sep 21, 2022 at 03:30:55PM -0400, Steven Sistare wrote:
If Steve wants to keep it then someone needs to fix the deadlock in the vfio implementation before any userspace starts to appear.
The only VFIO_DMA_UNMAP_FLAG_VADDR issue I am aware of is broken pinned accounting across exec, which can result in mm->locked_vm becoming negative. I have several fixes, but none result in limits being reached at exactly the same time as before -- the same general issue being discussed for iommufd. I am still thinking about it.
Oh, yeah, I noticed this was all busted up too.
I am not aware of a deadlock problem. Please elaborate or point me to an email thread.
VFIO_DMA_UNMAP_FLAG_VADDR open codes a lock in the kernel where userspace can tigger the lock to be taken and then returns to userspace with the lock held.
Any scenario where a kernel thread hits that open-coded lock and then userspace does-the-wrong-thing will deadlock the kernel.
For instance consider a mdev driver. We assert VFIO_DMA_UNMAP_FLAG_VADDR, the mdev driver does a DMA in a workqueue and becomes blocked on the now locked lock. Userspace then tries to close the device FD.
FD closure will trigger device close and the VFIO core code requirement is that mdev driver device teardown must halt all concurrent threads touching vfio_device. Thus the mdev will try to fence its workqeue and then deadlock - unable to flush/cancel a work that is currently blocked on a lock held by userspace that will never be unlocked.
This is just the first scenario that comes to mind. The approach to give userspace control of a lock that kernel threads can become blocked on is so completely sketchy it is a complete no-go in my opinion. If I had seen it when it was posted I would have hard NAK'd it.
My "full" solution in mind for iommufd is to pin all the memory upon VFIO_DMA_UNMAP_FLAG_VADDR, so we can continue satisfy DMA requests while the mm_struct is not available. But IMHO this is basically useless for any actual user of mdevs.
The other option is to just exclude mdevs and fail the VFIO_DMA_UNMAP_FLAG_VADDR if any are present, then prevent them from becoming present while it is asserted. In this way we don't need to do anything beyond a simple check as the iommu_domain is already fully populated and pinned.
Do we have a solution to this?
If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR
Aside from the approach to use the FD, another idea is to just use fork.
qemu would do something like
.. stop all container ioctl activity .. fork() ioctl(CHANGE_MM) // switch all maps to this mm .. signal parent.. .. wait parent.. exit(0) .. wait child .. exec() ioctl(CHANGE_MM) // switch all maps to this mm ..signal child.. waitpid(childpid)
This way the kernel is never left without a page provider for the maps, the dummy mm_struct belonging to the fork will serve that role for the gap.
And the above is only required if we have mdevs, so we could imagine userspace optimizing it away for, eg vfio-pci only cases.
It is not as efficient as using a FD backing, but this is super easy to implement in the kernel.
I propose to avoid deadlock for mediated devices as follows. Currently, an mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR is asserted. * In vfio_wait, I will maintain a list of waiters, each list element consisting of (task, mdev, close_flag=false). * When the vfio device descriptor is closed, vfio_device_fops_release will notify the vfio_iommu driver, which will find the mdev on the waiters list, set elem->close_flag=true, and call wake_up_process for the task. * The task will wake in vfio_wait, see close_flag=true, and return EFAULT to the mdev caller. This requires a little new plumbing. I will work out the details, but if you see a problem with the overall approach, please let me know. - Steve

On Mon, Oct 10, 2022 at 04:54:50PM -0400, Steven Sistare wrote:
Do we have a solution to this?
If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR
Aside from the approach to use the FD, another idea is to just use fork.
qemu would do something like
.. stop all container ioctl activity .. fork() ioctl(CHANGE_MM) // switch all maps to this mm .. signal parent.. .. wait parent.. exit(0) .. wait child .. exec() ioctl(CHANGE_MM) // switch all maps to this mm ..signal child.. waitpid(childpid)
This way the kernel is never left without a page provider for the maps, the dummy mm_struct belonging to the fork will serve that role for the gap.
And the above is only required if we have mdevs, so we could imagine userspace optimizing it away for, eg vfio-pci only cases.
It is not as efficient as using a FD backing, but this is super easy to implement in the kernel.
I propose to avoid deadlock for mediated devices as follows. Currently, an mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR is asserted.
* In vfio_wait, I will maintain a list of waiters, each list element consisting of (task, mdev, close_flag=false).
* When the vfio device descriptor is closed, vfio_device_fops_release will notify the vfio_iommu driver, which will find the mdev on the waiters list, set elem->close_flag=true, and call wake_up_process for the task.
This alone is not sufficient, the mdev driver can continue to establish new mappings until it's close_device function returns. Killing only existing mappings is racy. I think you are focusing on the one issue I pointed at, as I said, I'm sure there are more ways than just close to abuse this functionality to deadlock the kernel. I continue to prefer we remove it completely and do something more robust. I suggested two options. Jason

On 10/11/2022 8:30 AM, Jason Gunthorpe wrote:
On Mon, Oct 10, 2022 at 04:54:50PM -0400, Steven Sistare wrote:
Do we have a solution to this?
If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR
Aside from the approach to use the FD, another idea is to just use fork.
qemu would do something like
.. stop all container ioctl activity .. fork() ioctl(CHANGE_MM) // switch all maps to this mm .. signal parent.. .. wait parent.. exit(0) .. wait child .. exec() ioctl(CHANGE_MM) // switch all maps to this mm ..signal child.. waitpid(childpid)
This way the kernel is never left without a page provider for the maps, the dummy mm_struct belonging to the fork will serve that role for the gap.
And the above is only required if we have mdevs, so we could imagine userspace optimizing it away for, eg vfio-pci only cases.
It is not as efficient as using a FD backing, but this is super easy to implement in the kernel.
I propose to avoid deadlock for mediated devices as follows. Currently, an mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR is asserted.
* In vfio_wait, I will maintain a list of waiters, each list element consisting of (task, mdev, close_flag=false).
* When the vfio device descriptor is closed, vfio_device_fops_release will notify the vfio_iommu driver, which will find the mdev on the waiters list, set elem->close_flag=true, and call wake_up_process for the task.
This alone is not sufficient, the mdev driver can continue to establish new mappings until it's close_device function returns. Killing only existing mappings is racy.
I think you are focusing on the one issue I pointed at, as I said, I'm sure there are more ways than just close to abuse this functionality to deadlock the kernel.
I continue to prefer we remove it completely and do something more robust. I suggested two options.
It's not racy. New pin requests also land in vfio_wait if any vaddr's have been invalidated in any vfio_dma in the iommu. See vfio_iommu_type1_pin_pages() if (iommu->vaddr_invalid_count) vfio_find_dma_valid() vfio_wait() However, I will investigate saving a reference to the file object in the vfio_dma (for mappings backed by a file) and using that to translate IOVA's. I think that will be easier to use than fork/CHANGE_MM/exec, and may even be easier to use than VFIO_DMA_UNMAP_FLAG_VADDR. To be continued. - Steve

On Tue, Oct 11, 2022 at 04:30:58PM -0400, Steven Sistare wrote:
On 10/11/2022 8:30 AM, Jason Gunthorpe wrote:
On Mon, Oct 10, 2022 at 04:54:50PM -0400, Steven Sistare wrote:
Do we have a solution to this?
If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR
Aside from the approach to use the FD, another idea is to just use fork.
qemu would do something like
.. stop all container ioctl activity .. fork() ioctl(CHANGE_MM) // switch all maps to this mm .. signal parent.. .. wait parent.. exit(0) .. wait child .. exec() ioctl(CHANGE_MM) // switch all maps to this mm ..signal child.. waitpid(childpid)
This way the kernel is never left without a page provider for the maps, the dummy mm_struct belonging to the fork will serve that role for the gap.
And the above is only required if we have mdevs, so we could imagine userspace optimizing it away for, eg vfio-pci only cases.
It is not as efficient as using a FD backing, but this is super easy to implement in the kernel.
I propose to avoid deadlock for mediated devices as follows. Currently, an mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR is asserted.
* In vfio_wait, I will maintain a list of waiters, each list element consisting of (task, mdev, close_flag=false).
* When the vfio device descriptor is closed, vfio_device_fops_release will notify the vfio_iommu driver, which will find the mdev on the waiters list, set elem->close_flag=true, and call wake_up_process for the task.
This alone is not sufficient, the mdev driver can continue to establish new mappings until it's close_device function returns. Killing only existing mappings is racy.
I think you are focusing on the one issue I pointed at, as I said, I'm sure there are more ways than just close to abuse this functionality to deadlock the kernel.
I continue to prefer we remove it completely and do something more robust. I suggested two options.
It's not racy. New pin requests also land in vfio_wait if any vaddr's have been invalidated in any vfio_dma in the iommu. See vfio_iommu_type1_pin_pages() if (iommu->vaddr_invalid_count) vfio_find_dma_valid() vfio_wait()
I mean you can't do a one shot wakeup of only existing waiters, and you can't corrupt the container to wake up waiters for other devices, so I don't see how this can be made to work safely... It also doesn't solve any flow that doesn't trigger file close, like a process thread being stuck on the wait in the kernel. eg because a trapped mmio triggered an access or something. So it doesn't seem like a workable direction to me.
However, I will investigate saving a reference to the file object in the vfio_dma (for mappings backed by a file) and using that to translate IOVA's.
It is certainly the best flow, but it may be difficult. Eg the memfd work for KVM to do something similar is quite involved.
I think that will be easier to use than fork/CHANGE_MM/exec, and may even be easier to use than VFIO_DMA_UNMAP_FLAG_VADDR. To be continued.
Yes, certainly easier to use, I suggested CHANGE_MM because the kernel implementation is very easy, I could send you something to test w/ iommufd in a few hours effort probably. Anyhow, I think this conversation has convinced me there is no way to fix VFIO_DMA_UNMAP_FLAG_VADDR. I'll send a patch reverting it due to it being a security bug, basically. Jason

On 10/12/2022 8:32 AM, Jason Gunthorpe wrote:
On Tue, Oct 11, 2022 at 04:30:58PM -0400, Steven Sistare wrote:
On 10/11/2022 8:30 AM, Jason Gunthorpe wrote:
On Mon, Oct 10, 2022 at 04:54:50PM -0400, Steven Sistare wrote:
Do we have a solution to this?
If not I would like to make a patch removing VFIO_DMA_UNMAP_FLAG_VADDR
Aside from the approach to use the FD, another idea is to just use fork.
qemu would do something like
.. stop all container ioctl activity .. fork() ioctl(CHANGE_MM) // switch all maps to this mm .. signal parent.. .. wait parent.. exit(0) .. wait child .. exec() ioctl(CHANGE_MM) // switch all maps to this mm ..signal child.. waitpid(childpid)
This way the kernel is never left without a page provider for the maps, the dummy mm_struct belonging to the fork will serve that role for the gap.
And the above is only required if we have mdevs, so we could imagine userspace optimizing it away for, eg vfio-pci only cases.
It is not as efficient as using a FD backing, but this is super easy to implement in the kernel.
I propose to avoid deadlock for mediated devices as follows. Currently, an mdev calling vfio_pin_pages blocks in vfio_wait while VFIO_DMA_UNMAP_FLAG_VADDR is asserted.
* In vfio_wait, I will maintain a list of waiters, each list element consisting of (task, mdev, close_flag=false).
* When the vfio device descriptor is closed, vfio_device_fops_release will notify the vfio_iommu driver, which will find the mdev on the waiters list, set elem->close_flag=true, and call wake_up_process for the task.
This alone is not sufficient, the mdev driver can continue to establish new mappings until it's close_device function returns. Killing only existing mappings is racy.
I think you are focusing on the one issue I pointed at, as I said, I'm sure there are more ways than just close to abuse this functionality to deadlock the kernel.
I continue to prefer we remove it completely and do something more robust. I suggested two options.
It's not racy. New pin requests also land in vfio_wait if any vaddr's have been invalidated in any vfio_dma in the iommu. See vfio_iommu_type1_pin_pages() if (iommu->vaddr_invalid_count) vfio_find_dma_valid() vfio_wait()
I mean you can't do a one shot wakeup of only existing waiters, and you can't corrupt the container to wake up waiters for other devices, so I don't see how this can be made to work safely...
It also doesn't solve any flow that doesn't trigger file close, like a process thread being stuck on the wait in the kernel. eg because a trapped mmio triggered an access or something.
So it doesn't seem like a workable direction to me.
However, I will investigate saving a reference to the file object in the vfio_dma (for mappings backed by a file) and using that to translate IOVA's.
It is certainly the best flow, but it may be difficult. Eg the memfd work for KVM to do something similar is quite involved.
I think that will be easier to use than fork/CHANGE_MM/exec, and may even be easier to use than VFIO_DMA_UNMAP_FLAG_VADDR. To be continued.
Yes, certainly easier to use, I suggested CHANGE_MM because the kernel implementation is very easy, I could send you something to test w/ iommufd in a few hours effort probably.
Anyhow, I think this conversation has convinced me there is no way to fix VFIO_DMA_UNMAP_FLAG_VADDR. I'll send a patch reverting it due to it being a security bug, basically.
Please do not. Please give me the courtesy of time to develop a replacement before we delete it. Surely you can make progress on other opens areas of iommufd without needing to delete this immediately. - Steve

On Wed, Oct 12, 2022 at 09:50:53AM -0400, Steven Sistare wrote:
Anyhow, I think this conversation has convinced me there is no way to fix VFIO_DMA_UNMAP_FLAG_VADDR. I'll send a patch reverting it due to it being a security bug, basically.
Please do not. Please give me the courtesy of time to develop a replacement before we delete it. Surely you can make progress on other opens areas of iommufd without needing to delete this immediately.
I'm not worried about iommufd, I'm worried about shipping kernels with a significant security problem backed into them. As we cannot salvage this interface it should quickly deleted so that it doesn't cause any incidents. It will not effect your ability to create a replacement. Jason

On 10/12/2022 10:40 AM, Jason Gunthorpe wrote:
On Wed, Oct 12, 2022 at 09:50:53AM -0400, Steven Sistare wrote:
Anyhow, I think this conversation has convinced me there is no way to fix VFIO_DMA_UNMAP_FLAG_VADDR. I'll send a patch reverting it due to it being a security bug, basically.
Please do not. Please give me the courtesy of time to develop a replacement before we delete it. Surely you can make progress on other opens areas of iommufd without needing to delete this immediately.
I'm not worried about iommufd, I'm worried about shipping kernels with a significant security problem backed into them.
As we cannot salvage this interface it should quickly deleted so that it doesn't cause any incidents.
It will not effect your ability to create a replacement.
I am not convinced we cannot salvage the interface, and indeed I might want to reuse parts of it, and you are over-stating the risk of a feature that is already in millions of kernels and has been for years. Deleting it all before having a replacement hurts the people like myself who are continuing to develop and test live update in qemu on the latest kernels. - Steve

On Wed, Oct 12, 2022 at 10:55:57AM -0400, Steven Sistare wrote:
On 10/12/2022 10:40 AM, Jason Gunthorpe wrote:
On Wed, Oct 12, 2022 at 09:50:53AM -0400, Steven Sistare wrote:
Anyhow, I think this conversation has convinced me there is no way to fix VFIO_DMA_UNMAP_FLAG_VADDR. I'll send a patch reverting it due to it being a security bug, basically.
Please do not. Please give me the courtesy of time to develop a replacement before we delete it. Surely you can make progress on other opens areas of iommufd without needing to delete this immediately.
I'm not worried about iommufd, I'm worried about shipping kernels with a significant security problem backed into them.
As we cannot salvage this interface it should quickly deleted so that it doesn't cause any incidents.
It will not effect your ability to create a replacement.
I am not convinced we cannot salvage the interface, and indeed I might want to reuse parts of it, and you are over-stating the risk of a feature that is already in millions of kernels and has been for years. Deleting it all before having a replacement hurts the people like myself who are continuing to develop and test live update in qemu on the latest kernels.
I think this is a mistake, as I'm convinced it cannot be salvaged, but we could instead guard it by a config symbol and CONFIG_EXPERIMENTAL. Jason

On Wed, Sep 21, 2022 at 03:44:24PM -0300, Jason Gunthorpe wrote:
If /dev/vfio/vfio is provided by iommufd it may well have to trigger a different ulimit tracking - if that is the only sticking point it seems minor and should be addressed in some later series that adds /dev/vfio/vfio support to iommufd..
And I have come up with a nice idea for this that feels OK - Add a 'pin accounting compat' flag to struct iommufd_ctx (eg per FD) The flag is set to 1 if /dev/vfio/vfio was the cdev that opened the ctx An IOCTL issued by cap sysadmin can set the flag - If the flag is set we do not do pin accounting in the user. Instead we account for pins in the FD. The single FD cannot pass the rlimit. This nicely emulates the desired behavior from virtualization without creating all the problems with exec/fork/etc that per-task tracking has. Even in iommufd native mode a priviledged virtualization layer can use the ioctl to enter the old mode and pass the fd to qemu under a shared user. This should ease migration I guess. It can still be oversubscribed but it is now limited to the number of iommufd_ctx's *with devices* that the userspace can create. Since each device can be attached to only 1 iommufd this is a stronger limit than the task limit. 1 device given to the qemu will mean a perfect enforcement. (ignoring that a hostile qemu can still blow past the rlimit using concurrent rdma or io_uring) It is a small incremental step - does this suitably address the concern? Jason

On Wed, Sep 21, 2022 at 03:44:24PM -0300, Jason Gunthorpe wrote:
On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
The issue is where we account these pinned pages, where accounting is necessary such that a user cannot lock an arbitrary number of pages into RAM to generate a DoS attack.
It is worth pointing out that preventing a DOS attack doesn't actually work because a *task* limit is trivially bypassed by just spawning more tasks. So, as a security feature, this is already very questionable.
The malicious party on host VM hosts is generally the QEMU process. QEMU is normally prevented from spawning more tasks, both by SELinux controls and be the seccomp sandbox blocking clone() (except for thread creation). We need to constrain what any individual QEMU can do to the host, and the per-task mem locking limits can do that. The mgmt app is what spawns more tasks (QEMU instances) and they are generally a trusted party on the host, or they are already constrained in other ways such as cgroups or namespaces. The mgmt apps would be expected to not intentionally overcommit the host with VMs needing too much cummulative locked RAM. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Thu, Sep 22, 2022 at 12:20:50PM +0100, Daniel P. Berrangé wrote:
On Wed, Sep 21, 2022 at 03:44:24PM -0300, Jason Gunthorpe wrote:
On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
The issue is where we account these pinned pages, where accounting is necessary such that a user cannot lock an arbitrary number of pages into RAM to generate a DoS attack.
It is worth pointing out that preventing a DOS attack doesn't actually work because a *task* limit is trivially bypassed by just spawning more tasks. So, as a security feature, this is already very questionable.
The malicious party on host VM hosts is generally the QEMU process. QEMU is normally prevented from spawning more tasks, both by SELinux controls and be the seccomp sandbox blocking clone() (except for thread creation). We need to constrain what any individual QEMU can do to the host, and the per-task mem locking limits can do that.
Even with syscall limits simple things like execve (enabled eg for qemu self-upgrade) can corrupt the kernel task-based accounting to the point that the limits don't work. Also, you are skipping the fact that since every subsystem does this differently and wrong a qemu can still go at least 3x over the allocation using just normal allowed functionalities. Again, as a security feature this fundamentally does not work. We cannot account for a FD owned resource inside the task based mm_struct. There are always going to be exploitable holes. What you really want is a cgroup based limit that is consistently applied in the kernel. Regardless, since this seems pretty well entrenched I continue to suggest my simpler alternative of making it fd based instead of user based. At least that doesn't have the unsolvable bugs related to task accounting. Jason

On Thu, Sep 22, 2022 at 11:08:23AM -0300, Jason Gunthorpe wrote:
On Thu, Sep 22, 2022 at 12:20:50PM +0100, Daniel P. Berrangé wrote:
On Wed, Sep 21, 2022 at 03:44:24PM -0300, Jason Gunthorpe wrote:
On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
The issue is where we account these pinned pages, where accounting is necessary such that a user cannot lock an arbitrary number of pages into RAM to generate a DoS attack.
It is worth pointing out that preventing a DOS attack doesn't actually work because a *task* limit is trivially bypassed by just spawning more tasks. So, as a security feature, this is already very questionable.
The malicious party on host VM hosts is generally the QEMU process. QEMU is normally prevented from spawning more tasks, both by SELinux controls and be the seccomp sandbox blocking clone() (except for thread creation). We need to constrain what any individual QEMU can do to the host, and the per-task mem locking limits can do that.
Even with syscall limits simple things like execve (enabled eg for qemu self-upgrade) can corrupt the kernel task-based accounting to the point that the limits don't work.
Note, execve is currently blocked by default too by the default seccomp sandbox used with libvirt, as well as by the SELinux policy again. self-upgrade isn't a feature that exists (yet). With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Thu, Sep 22, 2022 at 03:49:02PM +0100, Daniel P. Berrangé wrote:
On Thu, Sep 22, 2022 at 11:08:23AM -0300, Jason Gunthorpe wrote:
On Thu, Sep 22, 2022 at 12:20:50PM +0100, Daniel P. Berrangé wrote:
On Wed, Sep 21, 2022 at 03:44:24PM -0300, Jason Gunthorpe wrote:
On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
The issue is where we account these pinned pages, where accounting is necessary such that a user cannot lock an arbitrary number of pages into RAM to generate a DoS attack.
It is worth pointing out that preventing a DOS attack doesn't actually work because a *task* limit is trivially bypassed by just spawning more tasks. So, as a security feature, this is already very questionable.
The malicious party on host VM hosts is generally the QEMU process. QEMU is normally prevented from spawning more tasks, both by SELinux controls and be the seccomp sandbox blocking clone() (except for thread creation). We need to constrain what any individual QEMU can do to the host, and the per-task mem locking limits can do that.
Even with syscall limits simple things like execve (enabled eg for qemu self-upgrade) can corrupt the kernel task-based accounting to the point that the limits don't work.
Note, execve is currently blocked by default too by the default seccomp sandbox used with libvirt, as well as by the SELinux policy again. self-upgrade isn't a feature that exists (yet).
That userspace has disabled half the kernel isn't an excuse for the kernel to be insecure by design :( This needs to be fixed to enable features we know are coming so.. What would libvirt land like to see given task based tracking cannot be fixed in the kernel? Jason

On Thu, Sep 22, 2022 at 11:51:54AM -0300, Jason Gunthorpe wrote:
On Thu, Sep 22, 2022 at 03:49:02PM +0100, Daniel P. Berrangé wrote:
On Thu, Sep 22, 2022 at 11:08:23AM -0300, Jason Gunthorpe wrote:
On Thu, Sep 22, 2022 at 12:20:50PM +0100, Daniel P. Berrangé wrote:
On Wed, Sep 21, 2022 at 03:44:24PM -0300, Jason Gunthorpe wrote:
On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
The issue is where we account these pinned pages, where accounting is necessary such that a user cannot lock an arbitrary number of pages into RAM to generate a DoS attack.
It is worth pointing out that preventing a DOS attack doesn't actually work because a *task* limit is trivially bypassed by just spawning more tasks. So, as a security feature, this is already very questionable.
The malicious party on host VM hosts is generally the QEMU process. QEMU is normally prevented from spawning more tasks, both by SELinux controls and be the seccomp sandbox blocking clone() (except for thread creation). We need to constrain what any individual QEMU can do to the host, and the per-task mem locking limits can do that.
Even with syscall limits simple things like execve (enabled eg for qemu self-upgrade) can corrupt the kernel task-based accounting to the point that the limits don't work.
Note, execve is currently blocked by default too by the default seccomp sandbox used with libvirt, as well as by the SELinux policy again. self-upgrade isn't a feature that exists (yet).
That userspace has disabled half the kernel isn't an excuse for the kernel to be insecure by design :( This needs to be fixed to enable features we know are coming so..
What would libvirt land like to see given task based tracking cannot be fixed in the kernel?
There needs to be a mechanism to control individual VMs, whether by task or by cgroup. User based limits are not suited to what we need to achieve. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Thu, Sep 22, 2022 at 04:00:00PM +0100, Daniel P. Berrangé wrote:
On Thu, Sep 22, 2022 at 11:51:54AM -0300, Jason Gunthorpe wrote:
On Thu, Sep 22, 2022 at 03:49:02PM +0100, Daniel P. Berrangé wrote:
On Thu, Sep 22, 2022 at 11:08:23AM -0300, Jason Gunthorpe wrote:
On Thu, Sep 22, 2022 at 12:20:50PM +0100, Daniel P. Berrangé wrote:
On Wed, Sep 21, 2022 at 03:44:24PM -0300, Jason Gunthorpe wrote:
On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote: > The issue is where we account these pinned pages, where accounting is > necessary such that a user cannot lock an arbitrary number of pages > into RAM to generate a DoS attack.
It is worth pointing out that preventing a DOS attack doesn't actually work because a *task* limit is trivially bypassed by just spawning more tasks. So, as a security feature, this is already very questionable.
The malicious party on host VM hosts is generally the QEMU process. QEMU is normally prevented from spawning more tasks, both by SELinux controls and be the seccomp sandbox blocking clone() (except for thread creation). We need to constrain what any individual QEMU can do to the host, and the per-task mem locking limits can do that.
Even with syscall limits simple things like execve (enabled eg for qemu self-upgrade) can corrupt the kernel task-based accounting to the point that the limits don't work.
Note, execve is currently blocked by default too by the default seccomp sandbox used with libvirt, as well as by the SELinux policy again. self-upgrade isn't a feature that exists (yet).
That userspace has disabled half the kernel isn't an excuse for the kernel to be insecure by design :( This needs to be fixed to enable features we know are coming so..
What would libvirt land like to see given task based tracking cannot be fixed in the kernel?
There needs to be a mechanism to control individual VMs, whether by task or by cgroup. User based limits are not suited to what we need to achieve.
The kernel has already standardized on user based limits here for other subsystems - libvirt and qemu cannot ignore that it exists. It is only a matter of time before qemu starts using these other subsystem features (eg io_uring) and has problems. So, IMHO, the future must be that libvirt/etc sets an unlimited rlimit, because the user approach is not going away in the kernel and it sounds like libvirt cannot accommodate it at all. This means we need to provide a new mechanism for future libvirt to use. Are you happy with cgroups? Once those points are decided, we need to figure out how best to continue to support historical libvirt and still meet the kernel security needs going forward. This is where I'm thinking about storing the tracking in the FD instead of the user. IMHO task based is something that cannot be made to work properly. Jason

On Thu, Sep 22, 2022 at 12:31:20PM -0300, Jason Gunthorpe wrote:
On Thu, Sep 22, 2022 at 04:00:00PM +0100, Daniel P. Berrangé wrote:
On Thu, Sep 22, 2022 at 11:51:54AM -0300, Jason Gunthorpe wrote:
On Thu, Sep 22, 2022 at 03:49:02PM +0100, Daniel P. Berrangé wrote:
On Thu, Sep 22, 2022 at 11:08:23AM -0300, Jason Gunthorpe wrote:
On Thu, Sep 22, 2022 at 12:20:50PM +0100, Daniel P. Berrangé wrote:
On Wed, Sep 21, 2022 at 03:44:24PM -0300, Jason Gunthorpe wrote: > On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote: > > The issue is where we account these pinned pages, where accounting is > > necessary such that a user cannot lock an arbitrary number of pages > > into RAM to generate a DoS attack. > > It is worth pointing out that preventing a DOS attack doesn't actually > work because a *task* limit is trivially bypassed by just spawning > more tasks. So, as a security feature, this is already very > questionable.
The malicious party on host VM hosts is generally the QEMU process. QEMU is normally prevented from spawning more tasks, both by SELinux controls and be the seccomp sandbox blocking clone() (except for thread creation). We need to constrain what any individual QEMU can do to the host, and the per-task mem locking limits can do that.
Even with syscall limits simple things like execve (enabled eg for qemu self-upgrade) can corrupt the kernel task-based accounting to the point that the limits don't work.
Note, execve is currently blocked by default too by the default seccomp sandbox used with libvirt, as well as by the SELinux policy again. self-upgrade isn't a feature that exists (yet).
That userspace has disabled half the kernel isn't an excuse for the kernel to be insecure by design :( This needs to be fixed to enable features we know are coming so..
What would libvirt land like to see given task based tracking cannot be fixed in the kernel?
There needs to be a mechanism to control individual VMs, whether by task or by cgroup. User based limits are not suited to what we need to achieve.
The kernel has already standardized on user based limits here for other subsystems - libvirt and qemu cannot ignore that it exists. It is only a matter of time before qemu starts using these other subsystem features (eg io_uring) and has problems.
So, IMHO, the future must be that libvirt/etc sets an unlimited rlimit, because the user approach is not going away in the kernel and it sounds like libvirt cannot accommodate it at all.
This means we need to provide a new mechanism for future libvirt to use. Are you happy with cgroups?
Yes, we use cgroups extensively already. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Fri, Sep 23, 2022 at 09:54:48AM +0100, Daniel P. Berrangé wrote:
Yes, we use cgroups extensively already.
Ok, I will try to see about this Can you also tell me if the selinux/seccomp will prevent qemu from opening more than one /dev/vfio/vfio ? I suppose the answer is no? Thanks, Jason

On Fri, Sep 23, 2022 at 10:29:41AM -0300, Jason Gunthorpe wrote:
On Fri, Sep 23, 2022 at 09:54:48AM +0100, Daniel P. Berrangé wrote:
Yes, we use cgroups extensively already.
Ok, I will try to see about this
Can you also tell me if the selinux/seccomp will prevent qemu from opening more than one /dev/vfio/vfio ? I suppose the answer is no?
I don't believe there's any restriction on the nubmer of open attempts, its just a case of allowed or denied globally for the VM. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Fri, Sep 23, 2022 at 02:35:20PM +0100, Daniel P. Berrangé wrote:
On Fri, Sep 23, 2022 at 10:29:41AM -0300, Jason Gunthorpe wrote:
On Fri, Sep 23, 2022 at 09:54:48AM +0100, Daniel P. Berrangé wrote:
Yes, we use cgroups extensively already.
Ok, I will try to see about this
Can you also tell me if the selinux/seccomp will prevent qemu from opening more than one /dev/vfio/vfio ? I suppose the answer is no?
I don't believe there's any restriction on the nubmer of open attempts, its just a case of allowed or denied globally for the VM.
Ok For iommufd we plan to have qemu accept a single already opened FD of /dev/iommu and so the selinux/etc would block all access to the chardev. Can you tell me if the thing invoking qmeu that will open /dev/iommu will have CAP_SYS_RESOURCE ? I assume yes if it is already touching ulimits.. Jason

On Fri, Sep 23, 2022 at 10:46:21AM -0300, Jason Gunthorpe wrote:
On Fri, Sep 23, 2022 at 02:35:20PM +0100, Daniel P. Berrangé wrote:
On Fri, Sep 23, 2022 at 10:29:41AM -0300, Jason Gunthorpe wrote:
On Fri, Sep 23, 2022 at 09:54:48AM +0100, Daniel P. Berrangé wrote:
Yes, we use cgroups extensively already.
Ok, I will try to see about this
Can you also tell me if the selinux/seccomp will prevent qemu from opening more than one /dev/vfio/vfio ? I suppose the answer is no?
I don't believe there's any restriction on the nubmer of open attempts, its just a case of allowed or denied globally for the VM.
Ok
For iommufd we plan to have qemu accept a single already opened FD of /dev/iommu and so the selinux/etc would block all access to the chardev.
A selinux policy update would be needed to allow read()/write() for the inherited FD, whle keeping open() blocked
Can you tell me if the thing invoking qmeu that will open /dev/iommu will have CAP_SYS_RESOURCE ? I assume yes if it is already touching ulimits..
The privileged libvirtd runs with privs equiv to root, so all capabilities are present. The unprivileged libvirtd runs with same privs as your user account, so no capabilities. I vaguely recall there was some way to enable use of PCI passthrough for unpriv libvirtd, but needed a bunch of admin setup steps ahead of time. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 9/23/22 10:00 AM, Daniel P. Berrangé wrote:
On Fri, Sep 23, 2022 at 10:46:21AM -0300, Jason Gunthorpe wrote:
On Fri, Sep 23, 2022 at 02:35:20PM +0100, Daniel P. Berrangé wrote:
On Fri, Sep 23, 2022 at 10:29:41AM -0300, Jason Gunthorpe wrote:
On Fri, Sep 23, 2022 at 09:54:48AM +0100, Daniel P. Berrangé wrote:
Yes, we use cgroups extensively already.
Ok, I will try to see about this
Can you also tell me if the selinux/seccomp will prevent qemu from opening more than one /dev/vfio/vfio ? I suppose the answer is no?
I don't believe there's any restriction on the nubmer of open attempts, its just a case of allowed or denied globally for the VM.
Ok
For iommufd we plan to have qemu accept a single already opened FD of /dev/iommu and so the selinux/etc would block all access to the chardev.
A selinux policy update would be needed to allow read()/write() for the inherited FD, whle keeping open() blocked
Can you tell me if the thing invoking qmeu that will open /dev/iommu will have CAP_SYS_RESOURCE ? I assume yes if it is already touching ulimits..
The privileged libvirtd runs with privs equiv to root, so all capabilities are present.
The unprivileged libvirtd runs with same privs as your user account, so no capabilities. I vaguely recall there was some way to enable use of PCI passthrough for unpriv libvirtd, but needed a bunch of admin setup steps ahead of time.
It's been a few years, but my recollection is that before starting a libvirtd that will run a guest with a vfio device, a privileged process needs to 1) increase the locked memory limit for the user that will be running qemu (eg. by adding a file with the increased limit to /etc/security/limits.d) 2) bind the device to the vfio-pci driver, and 3) chown /dev/vfio/$iommu_group to the user running qemu.

On Fri, Sep 23, 2022 at 11:40:51AM -0400, Laine Stump wrote:
It's been a few years, but my recollection is that before starting a libvirtd that will run a guest with a vfio device, a privileged process needs to
1) increase the locked memory limit for the user that will be running qemu (eg. by adding a file with the increased limit to /etc/security/limits.d)
2) bind the device to the vfio-pci driver, and
3) chown /dev/vfio/$iommu_group to the user running qemu.
Here is what is going on to resolve this: 1) iommufd internally supports two ways to account ulimits, the vfio way and the io_uring way. Each FD operates in its own mode. When /dev/iommu is opened the FD defaults to the io_uring way, when /dev/vfio/vfio is opened it uses the VFIO way. This means /dev/vfio/vfio is not a symlink, there is a new kconfig now to make iommufd directly provide a miscdev. 2) There is an ioctl IOMMU_OPTION_RLIMIT_MODE which allows a privileged user to query/set which mode the FD will run in. The idea is that libvirt will open iommufd, the first action will be to set vfio compat mode, and then it will fd pass the fd to qemu and qemu will operate in the correct sandbox. 3) We are working on a cgroup for FOLL_LONGTERM, it is a big job but this should prove a comprehensive resolution to this problem across the kernel and improve the qemu sandbox security. Still TBD, but most likely when the cgroup supports this libvirt would set the rlimit to unlimited, then set new mlock and FOLL_LONGTERM cgroup limits to create the sandbox. Jason

On Fri, 23 Sep 2022 10:29:41 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote:
On Fri, Sep 23, 2022 at 09:54:48AM +0100, Daniel P. Berrangé wrote:
Yes, we use cgroups extensively already.
Ok, I will try to see about this
Can you also tell me if the selinux/seccomp will prevent qemu from opening more than one /dev/vfio/vfio ? I suppose the answer is no?
QEMU manages the container:group association with legacy vfio, so it can't be restricted from creating multiple containers. Thanks, Alex

On Fri, Sep 23, 2022 at 08:03:07AM -0600, Alex Williamson wrote:
On Fri, 23 Sep 2022 10:29:41 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote:
On Fri, Sep 23, 2022 at 09:54:48AM +0100, Daniel P. Berrangé wrote:
Yes, we use cgroups extensively already.
Ok, I will try to see about this
Can you also tell me if the selinux/seccomp will prevent qemu from opening more than one /dev/vfio/vfio ? I suppose the answer is no?
QEMU manages the container:group association with legacy vfio, so it can't be restricted from creating multiple containers. Thanks,
.. and it absolutely will create multiple containers (i.e. open /dev/vfio/vfio multiple times) if there are multiple guest-side vIOMMU domains. It can, however, open each /dev/vfio/NN group file only once each, since they are exclusive access. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

On 9/21/22 2:06 PM, Alex Williamson wrote:
[Cc+ Steve, libvirt, Daniel, Laine]
On Tue, 20 Sep 2022 16:56:42 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote:
On Tue, Sep 13, 2022 at 09:28:18AM +0200, Eric Auger wrote:
Hi,
On 9/13/22 03:55, Tian, Kevin wrote:
We didn't close the open of how to get this merged in LPC due to the audio issue. Then let's use mails.
Overall there are three options on the table:
1) Require vfio-compat to be 100% compatible with vfio-type1
Probably not a good choice given the amount of work to fix the remaining gaps. And this will block support of new IOMMU features for a longer time.
2) Leave vfio-compat as what it is in this series
Treat it as a vehicle to validate the iommufd logic instead of immediately replacing vfio-type1. Functionally most vfio applications can work w/o change if putting aside the difference on locked mm accounting, p2p, etc.
Then work on new features and 100% vfio-type1 compat. in parallel.
3) Focus on iommufd native uAPI first
Require vfio_device cdev and adoption in Qemu. Only for new vfio app.
Then work on new features and vfio-compat in parallel.
I'm fine with either 2) or 3). Per a quick chat with Alex he prefers to 3).
I am also inclined to pursue 3) as this was the initial Jason's guidance and pre-requisite to integrate new features. In the past we concluded vfio-compat would mostly be used for testing purpose. Our QEMU integration fully is based on device based API.
There are some poor chicken and egg problems here.
I had some assumptions: a - the vfio cdev model is going to be iommufd only b - any uAPI we add as we go along should be generally useful going forward c - we should try to minimize the 'minimally viable iommufd' series
The compat as it stands now (eg #2) is threading this needle. Since it can exist without cdev it means (c) is made smaller, to two series.
Since we add something useful to some use cases, eg DPDK is deployable that way, (b) is OK.
If we focus on a strict path with 3, and avoid adding non-useful code, then we have to have two more (unwritten!) series beyond where we are now - vfio group compartmentalization, and cdev integration, and the initial (c) will increase.
3 also has us merging something that currently has no usable userspace, which I also do dislike alot.
I still think the compat gaps are small. I've realized that VFIO_DMA_UNMAP_FLAG_VADDR has no implementation in qemu, and since it can deadlock the kernel I propose we purge it completely.
Steve won't be happy to hear that, QEMU support exists but isn't yet merged.
P2P is ongoing.
That really just leaves the accounting, and I'm still not convinced at this must be a critical thing. Linus's latest remarks reported in lwn at the maintainer summit on tracepoints/BPF as ABI seem to support this. Let's see an actual deployed production configuration that would be impacted, and we won't find that unless we move forward.
I'll try to summarize the proposed change so that we can get better advice from libvirt folks, or potentially anyone else managing locked memory limits for device assignment VMs.
Background: when a DMA range, ex. guest RAM, is mapped to a vfio device, we use the system IOMMU to provide GPA to HPA translation for assigned devices. Unlike CPU page tables, we don't generally have a means to demand fault these translations, therefore the memory target of the translation is pinned to prevent that it cannot be swapped or relocated, ie. to guarantee the translation is always valid.
The issue is where we account these pinned pages, where accounting is necessary such that a user cannot lock an arbitrary number of pages into RAM to generate a DoS attack. Duplicate accounting should be resolved by iommufd, but is outside the scope of this discussion.
Currently, vfio tests against the mm_struct.locked_vm relative to rlimit(RLIMIT_MEMLOCK), which reads task->signal->rlim[limit].rlim_cur, where task is the current process. This is the same limit set via the setrlimit syscall used by prlimit(1) and reported via 'ulimit -l'.
Note that in both cases above, we're dealing with a task, or process limit and both prlimit and ulimit man pages describe them as such.
iommufd supposes instead, and references existing kernel implementations, that despite the descriptions above these limits are actually meant to be user limits and therefore instead charges pinned pages against user_struct.locked_vm and also marks them in mm_struct.pinned_vm.
The proposed algorithm is to read the _task_ locked memory limit, then attempt to charge the _user_ locked_vm, such that user_struct.locked_vm cannot exceed the task locked memory limit.
This obviously has implications. AFAICT, any management tool that doesn't instantiate assigned device VMs under separate users are essentially untenable. For example, if we launch VM1 under userA and set a locked memory limit of 4GB via prlimit to account for an assigned device, that works fine, until we launch VM2 from userA as well. In that case we can't simply set a 4GB limit on the VM2 task because there's already 4GB charged against user_struct.locked_vm for VM1. So we'd need to set the VM2 task limit to 8GB to be able to launch VM2. But not only that, we'd need to go back and also set VM1's task limit to 8GB or else it will fail if a DMA mapped memory region is transient and needs to be re-mapped.
Effectively any task under the same user and requiring pinned memory needs to have a locked memory limit set, and updated, to account for all tasks using pinned memory by that user.
How does this affect known current use cases of locked memory management for assigned device VMs?
Does qemu://system by default sandbox into per VM uids or do they all use the qemu user by default.
Unless it is told otherwise in the XML for the VMs, each qemu process uses the same uid (which is usually "qemu", but can be changed in systemwide config).
I imagine qemu://session mode is pretty screwed by this, but I also don't know who/where locked limits are lifted for such VMs. Boxes, who I think now supports assigned device VMs, could also be affected.
because qemu:///session runs an unprivileged libvirt (i.e. unable to raise the limits), boxes sets the limits elsewhere beforehand (not sure where, as I'm not familiar with boxes source).
So, I still like 2 because it yields the smallest next step before we can bring all the parallel work onto the list, and it makes testing and converting non-qemu stuff easier even going forward.
If a vfio compatible interface isn't transparently compatible, then I have a hard time understanding its value. Please correct my above description and implications, but I suspect these are not just theoretical ABI compat issues. Thanks,
Alex

On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
[Cc+ Steve, libvirt, Daniel, Laine]
On Tue, 20 Sep 2022 16:56:42 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote:
That really just leaves the accounting, and I'm still not convinced at this must be a critical thing. Linus's latest remarks reported in lwn at the maintainer summit on tracepoints/BPF as ABI seem to support this. Let's see an actual deployed production configuration that would be impacted, and we won't find that unless we move forward.
I'll try to summarize the proposed change so that we can get better advice from libvirt folks, or potentially anyone else managing locked memory limits for device assignment VMs.
Background: when a DMA range, ex. guest RAM, is mapped to a vfio device, we use the system IOMMU to provide GPA to HPA translation for assigned devices. Unlike CPU page tables, we don't generally have a means to demand fault these translations, therefore the memory target of the translation is pinned to prevent that it cannot be swapped or relocated, ie. to guarantee the translation is always valid.
The issue is where we account these pinned pages, where accounting is necessary such that a user cannot lock an arbitrary number of pages into RAM to generate a DoS attack. Duplicate accounting should be resolved by iommufd, but is outside the scope of this discussion.
Currently, vfio tests against the mm_struct.locked_vm relative to rlimit(RLIMIT_MEMLOCK), which reads task->signal->rlim[limit].rlim_cur, where task is the current process. This is the same limit set via the setrlimit syscall used by prlimit(1) and reported via 'ulimit -l'.
Note that in both cases above, we're dealing with a task, or process limit and both prlimit and ulimit man pages describe them as such.
iommufd supposes instead, and references existing kernel implementations, that despite the descriptions above these limits are actually meant to be user limits and therefore instead charges pinned pages against user_struct.locked_vm and also marks them in mm_struct.pinned_vm.
The proposed algorithm is to read the _task_ locked memory limit, then attempt to charge the _user_ locked_vm, such that user_struct.locked_vm cannot exceed the task locked memory limit.
This obviously has implications. AFAICT, any management tool that doesn't instantiate assigned device VMs under separate users are essentially untenable. For example, if we launch VM1 under userA and set a locked memory limit of 4GB via prlimit to account for an assigned device, that works fine, until we launch VM2 from userA as well. In that case we can't simply set a 4GB limit on the VM2 task because there's already 4GB charged against user_struct.locked_vm for VM1. So we'd need to set the VM2 task limit to 8GB to be able to launch VM2. But not only that, we'd need to go back and also set VM1's task limit to 8GB or else it will fail if a DMA mapped memory region is transient and needs to be re-mapped.
Effectively any task under the same user and requiring pinned memory needs to have a locked memory limit set, and updated, to account for all tasks using pinned memory by that user.
That is pretty unpleasant. Having to update all existing VMs, when starting a new VM (or hotpluigging a VFIO device to an existing VM) is something we would never want todo. Charging this against the user weakens the DoS protection that we have today from the POV of individual VMs. Our primary risk is that a single QEMU is compromised and attempts to impact the host in some way. We want to limit the damage that an individual QEMU can cause. Consider 4 VMs each locked with 4 GB. Any single compromised VM is constrained to only use 4 GB of locked memory. With the per-user accounting, now any single compromised VM can use the cummulative 16 GB of locked memory, even though we only want that VM to be able to use 4 GB. For a cummulative memory limit, we would expect cgroups to be the enforcement mechanism. eg consider a machine has 64 GB of RAM, and we want to reserve 12 GB for host OS ensure, and all other RAM is for VM usage. A mgmt app wanting such protecton would set a limit on /machines.slice, at the 52 GB mark, to reserve the 12 GB for non-VM usage. Also, the mgmt app accounts for how many VMs it has started on a host and will not try to launch more VMs than there is RAM available to support them. Accounting at the user level instead of the task level, is effectively trying to protect against the bad mgmt app trying to overcommit the host. That is not really important, as the mgmt app is so privileged it is already assumed to be a trusted host component akin to a root acocunt. So per-user locked mem accounting looks like a regression in our VM isolation abilities compared to the per-task accounting.
How does this affect known current use cases of locked memory management for assigned device VMs?
It will affect every single application using libvirt today, with the possible exception of KubeVirt. KubeVirt puts each VM in a separate container, and if they have userns enabled, those will each get distinct UIDs for accounting purposes. I expect every other usage of libvrit to be affected, unless the mgmt app has gone out of its way to configure a dedicated UID for each QEMU - I'm not aware of any which do this.
Does qemu://system by default sandbox into per VM uids or do they all use the qemu user by default. I imagine qemu://session mode is pretty screwed by this, but I also don't know who/where locked limits are lifted for such VMs. Boxes, who I think now supports assigned device VMs, could also be affected.
In a out of the box config qemu:///system will always run all VMs under the user:group pairing qemu:qemu. Mgmt apps can requested a dedicated UID per VM in the XML config, but I'm not aware of any doing this. Someone might, but we should assume the majority will not. IOW, it affects essentially all libvirt usage of VFIO.
So, I still like 2 because it yields the smallest next step before we can bring all the parallel work onto the list, and it makes testing and converting non-qemu stuff easier even going forward.
If a vfio compatible interface isn't transparently compatible, then I have a hard time understanding its value. Please correct my above description and implications, but I suspect these are not just theoretical ABI compat issues. Thanks,
With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Thu, Sep 22, 2022 at 12:06:33PM +0100, Daniel P. Berrangé wrote:
So per-user locked mem accounting looks like a regression in our VM isolation abilities compared to the per-task accounting.
For this kind of API the management app needs to put each VM in its own user, which I'm a bit surprised it doesn't already do as a further protection against cross-process concerns. The question here is how to we provide enough compatability for this existing methodology while still closing the security holes and inconsistencies that exist in the kernel implementation. Jason

On Thu, Sep 22, 2022 at 11:13:42AM -0300, Jason Gunthorpe wrote:
On Thu, Sep 22, 2022 at 12:06:33PM +0100, Daniel P. Berrangé wrote:
So per-user locked mem accounting looks like a regression in our VM isolation abilities compared to the per-task accounting.
For this kind of API the management app needs to put each VM in its own user, which I'm a bit surprised it doesn't already do as a further protection against cross-process concerns.
Putting VMs in dedicated users is not practical to automatically do on a general purpose OS install, because there's no arbitrator of what UID ranges can be safely used without conflicting with other usage on the OS. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
participants (6)
-
Alex Williamson
-
Daniel P. Berrangé
-
David Gibson
-
Jason Gunthorpe
-
Laine Stump
-
Steven Sistare