Quoting Daniel P. Berrange (berrange(a)redhat.com):
On Thu, Aug 25, 2011 at 08:58:27AM -0500, Serge E. Hallyn wrote:
> Quoting Stefan Hajnoczi (stefanha(a)gmail.com):
> > On Thu, Aug 25, 2011 at 11:03 AM, Daniel P. Berrange
> > <berrange(a)redhat.com> wrote:
> > > On Thu, Aug 25, 2011 at 10:10:27AM +0100, Stefan Hajnoczi wrote:
> > >> On Wed, Aug 24, 2011 at 3:46 PM, Daniel P. Berrange
<berrange(a)redhat.com> wrote:
> > >> > On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote:
> > >> >> On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange
<berrange(a)redhat.com> wrote:
> > >> >> > On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan
Hajnoczi wrote:
> > >> >> >> On Tue, Aug 23, 2011 at 12:15 PM, Daniel P.
Berrange
> > >> >> >> <berrange(a)redhat.com> wrote:
> > >> >> >> > I was at the KVM Forum / LinuxCon last week and
there were many
> > >> >> >> > interesting things discussed which are relevant
to ongoing libvirt
> > >> >> >> > development. Here was the list that caught my
attention. If I have
> > >> >> >> > missed any, fill in the gaps....
> > >> >> >> >
> > >> >> >> > - Sandbox/container KVM. The Solaris port of
KVM puts QEMU inside
> > >> >> >> > a zone so that an exploit of QEMU can't
escape into the full OS.
> > >> >> >> > Containers are Linux's parallel of Zones,
and while not nearly as
> > >> >> >> > secure yet, it would still be worth using
more containers support
> > >> >> >> > to confine QEMU.
> > >> >> >>
> > >> >> >> Can you elaborate on why Linux containers are
"not nearly as secure"
> > >> >> >> [as Solaris Zones]?
> > >> >> >
> > >> >> > Mostly because the Linux namespace functionality is far
from complete,
> > >> >> > notably lacking proper UID/GID/capability separation,
and UID/GID
> > >> >> > virtualization wrt filesystems. The longer answer is
here:
> > >> >> >
> > >> >> >
https://wiki.ubuntu.com/UserNamespace
> > >> >> >
> > >> >> > So at this time you can't build a secure container
on Linux, relying
> > >> >> > just on DAC alone. You have to add in a MAC layer ontop
of the container
> > >> >> > to get full security benefits, which obviously defeats
the point of
> > >> >> > using the container as a backup for failure in the MAC
layer.
> > >> >>
> > >> >> Thanks, that is interesting. I still don't understand
why that is a
> > >> >> problem. Linux containers (lxc) uses a different pid
namespace (no
> > >> >> ptrace worries), file system root (restricted to a
subdirectory tree),
> > >> >> forbids most device nodes, etc. Why does the user namespace
matter
> > >> >> for security in this case?
> > >> >
> > >> > A number of reasons really...
> > >> >
> > >> > If user ID '0' on the host starts a container, and a
process inside
> > >> > the container does 'setuid(500)', then any user outside
the container
> > >> > with UID 500 will be able to kill that process. Only user ID
'0' should
> > >> > have been allowed todo that.
> > >> >
> > >> > It will also let non-root user IDs on the host OS, start
containers
> > >> > and have root uid=0 inside the container.
> > >> >
> > >> > Finally, any files created inside the container with, say, uid
500
> > >> > will be accessible by any other process with UID 500, in either
the
> > >> > host or any other container
> > >>
> > >> These points mean that the host can peek inside containers and has
> > >> access to their processes/files. But from the point of a libvirt
> > >> running inside a container there is no security problem.
> > >>
> > >> This is kind of like saying that root on the host can modify KVM
guest
> > >> disk images. That is true but I don't see it as a security
problem
> > >> because the root on the host is the trusted part of the system.
> > >>
> > >> >> I think it matters when giving multiple containers access to
the same
> > >> >> file system. Is that what you'd like to do for libvirt?
> > >> >
> > >> > Each container would have to share a (readonly) view onto the
host
> > >> > filesystem so it can see the QEMU emulator install / libraries.
There
> > >> > would also have to be some writable areas per QEMU container.
QEMU
> > >> > inside the container would be set to run as some non-root UID
(from
> > >> > the container's POV). So both problem 1 & 3 above would
impact the
> > >> > security of this confinement.
> > >>
> > >> But is there a way to escape confinement? If not, then this is
secure.
> > >
> > > The filesystem UID/GID ownership is the most likely way you can escape
> > > the confinement. You would have to be very careful to ensure that each
> > > container's view of the filesystem did not include any directories
> > > with files that are assigned to another container, since the UID
> > > separation would not prevent access to another container's resources.
> > >
> > > This is rather tedious but could be just about doable, but it gets
> > > harder when you throw in things like sysfs and PCI device assignment.
> > > eg a guest with PCI device assigned gets given ownership of the files
> > > in /sys/bus/pci/devices/0000:00:XX:XX/ and since there is no UID
> > > namespacing, this will be accessible to any other container with the
> > > same UID. To hack around this when starting up a container you would
> > > probably have to bind mount a empty tmpfs over the top of all the
> > > PCI device paths you wanted to block in sysfs.
>
> Which of course is easily undoable by root in the container :)
Yep, you'd have to make sure QEMU was none root for it to be at all
practical.
> > Ah, I hadn't thought of /sys/bus/pci or /sys/bus/usb!
> >
> > Thanks for the explanation and it does seem like the design would get messy.
>
> And plenty more, i.e.
http://blog.bofh.it/debian/id_413
Cool a nice demo :-)
> See
http://sourceforge.net/mailarchive/message.php?msg_id=27878921 for
> someone actively using Smack to help mitigate this (which could also be
> done with SELinux).
Yes, I've got the same done with SELinux, but haven't posted it for
review yet, since it needs more testing and some policy additions
https://gitorious.org/~berrange/libvirt/staging/commits/lxc-svirt Of course in the context of this discussion, QEMU already runs under
SELinux, and my desire for containers was to act as a safety net for
when SELinux fails for some reason (or is disabled by an admin) so
back to square one wrt security :-)
You also might consider seccomp2, WHEN it lands :) I trust that once
qemu is running, it doesn't need too baroque a set of a system calls.
-serge