On Thu, Sep 22, 2022 at 12:31:20PM -0300, Jason Gunthorpe wrote:
On Thu, Sep 22, 2022 at 04:00:00PM +0100, Daniel P. Berrangé wrote:
> On Thu, Sep 22, 2022 at 11:51:54AM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 22, 2022 at 03:49:02PM +0100, Daniel P. Berrangé wrote:
> > > On Thu, Sep 22, 2022 at 11:08:23AM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Sep 22, 2022 at 12:20:50PM +0100, Daniel P. Berrangé wrote:
> > > > > On Wed, Sep 21, 2022 at 03:44:24PM -0300, Jason Gunthorpe
wrote:
> > > > > > On Wed, Sep 21, 2022 at 12:06:49PM -0600, Alex Williamson
wrote:
> > > > > > > The issue is where we account these pinned pages,
where accounting is
> > > > > > > necessary such that a user cannot lock an arbitrary
number of pages
> > > > > > > into RAM to generate a DoS attack.
> > > > > >
> > > > > > It is worth pointing out that preventing a DOS attack
doesn't actually
> > > > > > work because a *task* limit is trivially bypassed by just
spawning
> > > > > > more tasks. So, as a security feature, this is already
very
> > > > > > questionable.
> > > > >
> > > > > The malicious party on host VM hosts is generally the QEMU
process.
> > > > > QEMU is normally prevented from spawning more tasks, both by
SELinux
> > > > > controls and be the seccomp sandbox blocking clone() (except
for
> > > > > thread creation). We need to constrain what any individual QEMU
can
> > > > > do to the host, and the per-task mem locking limits can do
that.
> > > >
> > > > Even with syscall limits simple things like execve (enabled eg for
> > > > qemu self-upgrade) can corrupt the kernel task-based accounting to
the
> > > > point that the limits don't work.
> > >
> > > Note, execve is currently blocked by default too by the default
> > > seccomp sandbox used with libvirt, as well as by the SELinux
> > > policy again. self-upgrade isn't a feature that exists (yet).
> >
> > That userspace has disabled half the kernel isn't an excuse for the
> > kernel to be insecure by design :( This needs to be fixed to enable
> > features we know are coming so..
> >
> > What would libvirt land like to see given task based tracking cannot
> > be fixed in the kernel?
>
> There needs to be a mechanism to control individual VMs, whether by
> task or by cgroup. User based limits are not suited to what we need
> to achieve.
The kernel has already standardized on user based limits here for
other subsystems - libvirt and qemu cannot ignore that it exists. It
is only a matter of time before qemu starts using these other
subsystem features (eg io_uring) and has problems.
So, IMHO, the future must be that libvirt/etc sets an unlimited
rlimit, because the user approach is not going away in the kernel and
it sounds like libvirt cannot accommodate it at all.
This means we need to provide a new mechanism for future libvirt to
use. Are you happy with cgroups?
Yes, we use cgroups extensively already.
With regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|