Re: [PATCH RFC v2 00/13] IOMMUFD Generic interface

22 Sep 2022

      On Thu, Sep 22, 2022 at 03:49:02PM +0100, Daniel P. Berrangé wrote:
...
On Thu, Sep 22, 2022 at 11:08:23AM -0300, Jason Gunthorpe wrote:
...
On Thu, Sep 22, 2022 at 12:20:50PM +0100, Daniel P. Berrangé wrote:
...
On Wed, Sep 21, 2022 at 03:44:24PM -0300, Jason Gunthorpe wrote:
...
On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
...
The issue is where we account these pinned pages, where accounting is
necessary such that a user cannot lock an arbitrary number of pages
into RAM to generate a DoS attack.
It is worth pointing out that preventing a DOS attack doesn't actually
work because a *task* limit is trivially bypassed by just spawning
more tasks. So, as a security feature, this is already very
questionable.
The malicious party on host VM hosts is generally the QEMU process.
QEMU is normally prevented from spawning more tasks, both by SELinux
controls and be the seccomp sandbox blocking clone() (except for
thread creation).  We need to constrain what any individual QEMU can
do to the host, and the per-task mem locking limits can do that.
Even with syscall limits simple things like execve (enabled eg for
qemu self-upgrade) can corrupt the kernel task-based accounting to the
point that the limits don't work.
Note, execve is currently blocked by default too by the default
seccomp sandbox used with libvirt, as well as by the SELinux
policy again.  self-upgrade isn't a feature that exists (yet).
That userspace has disabled half the kernel isn't an excuse for the
kernel to be insecure by design :( This needs to be fixed to enable
features we know are coming so..

What would libvirt land like to see given task based tracking cannot
be fixed in the kernel?

Jason