On Thu, Sep 22, 2022 at 12:20:50PM +0100, Daniel P. Berrangé wrote:
On Wed, Sep 21, 2022 at 03:44:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson wrote:
> > The issue is where we account these pinned pages, where accounting is
> > necessary such that a user cannot lock an arbitrary number of pages
> > into RAM to generate a DoS attack.
>
> It is worth pointing out that preventing a DOS attack doesn't actually
> work because a *task* limit is trivially bypassed by just spawning
> more tasks. So, as a security feature, this is already very
> questionable.
The malicious party on host VM hosts is generally the QEMU process.
QEMU is normally prevented from spawning more tasks, both by SELinux
controls and be the seccomp sandbox blocking clone() (except for
thread creation). We need to constrain what any individual QEMU can
do to the host, and the per-task mem locking limits can do that.
Even with syscall limits simple things like execve (enabled eg for
qemu self-upgrade) can corrupt the kernel task-based accounting to the
point that the limits don't work.
Also, you are skipping the fact that since every subsystem does this
differently and wrong a qemu can still go at least 3x over the
allocation using just normal allowed functionalities.
Again, as a security feature this fundamentally does not work. We
cannot account for a FD owned resource inside the task based
mm_struct. There are always going to be exploitable holes.
What you really want is a cgroup based limit that is consistently
applied in the kernel.
Regardless, since this seems pretty well entrenched I continue to
suggest my simpler alternative of making it fd based instead of user
based. At least that doesn't have the unsolvable bugs related to task
accounting.
Jason