On Thu, Aug 25, 2011 at 08:58:27AM -0500, Serge E. Hallyn wrote:
Quoting Stefan Hajnoczi (stefanha(a)gmail.com):
> On Thu, Aug 25, 2011 at 11:03 AM, Daniel P. Berrange
> <berrange(a)redhat.com> wrote:
> > On Thu, Aug 25, 2011 at 10:10:27AM +0100, Stefan Hajnoczi wrote:
> >> On Wed, Aug 24, 2011 at 3:46 PM, Daniel P. Berrange
<berrange(a)redhat.com> wrote:
> >> > On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote:
> >> >> On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange
<berrange(a)redhat.com> wrote:
> >> >> > On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi
wrote:
> >> >> >> On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange
> >> >> >> <berrange(a)redhat.com> wrote:
> >> >> >> > I was at the KVM Forum / LinuxCon last week and
there were many
> >> >> >> > interesting things discussed which are relevant to
ongoing libvirt
> >> >> >> > development. Here was the list that caught my
attention. If I have
> >> >> >> > missed any, fill in the gaps....
> >> >> >> >
> >> >> >> > - Sandbox/container KVM. The Solaris port of KVM
puts QEMU inside
> >> >> >> > a zone so that an exploit of QEMU can't escape
into the full OS.
> >> >> >> > Containers are Linux's parallel of Zones, and
while not nearly as
> >> >> >> > secure yet, it would still be worth using more
containers support
> >> >> >> > to confine QEMU.
> >> >> >>
> >> >> >> Can you elaborate on why Linux containers are "not
nearly as secure"
> >> >> >> [as Solaris Zones]?
> >> >> >
> >> >> > Mostly because the Linux namespace functionality is far from
complete,
> >> >> > notably lacking proper UID/GID/capability separation, and
UID/GID
> >> >> > virtualization wrt filesystems. The longer answer is here:
> >> >> >
> >> >> >
https://wiki.ubuntu.com/UserNamespace
> >> >> >
> >> >> > So at this time you can't build a secure container on
Linux, relying
> >> >> > just on DAC alone. You have to add in a MAC layer ontop of
the container
> >> >> > to get full security benefits, which obviously defeats the
point of
> >> >> > using the container as a backup for failure in the MAC
layer.
> >> >>
> >> >> Thanks, that is interesting. I still don't understand why
that is a
> >> >> problem. Linux containers (lxc) uses a different pid namespace
(no
> >> >> ptrace worries), file system root (restricted to a subdirectory
tree),
> >> >> forbids most device nodes, etc. Why does the user namespace
matter
> >> >> for security in this case?
> >> >
> >> > A number of reasons really...
> >> >
> >> > If user ID '0' on the host starts a container, and a process
inside
> >> > the container does 'setuid(500)', then any user outside the
container
> >> > with UID 500 will be able to kill that process. Only user ID
'0' should
> >> > have been allowed todo that.
> >> >
> >> > It will also let non-root user IDs on the host OS, start containers
> >> > and have root uid=0 inside the container.
> >> >
> >> > Finally, any files created inside the container with, say, uid 500
> >> > will be accessible by any other process with UID 500, in either the
> >> > host or any other container
> >>
> >> These points mean that the host can peek inside containers and has
> >> access to their processes/files. But from the point of a libvirt
> >> running inside a container there is no security problem.
> >>
> >> This is kind of like saying that root on the host can modify KVM guest
> >> disk images. That is true but I don't see it as a security problem
> >> because the root on the host is the trusted part of the system.
> >>
> >> >> I think it matters when giving multiple containers access to the
same
> >> >> file system. Is that what you'd like to do for libvirt?
> >> >
> >> > Each container would have to share a (readonly) view onto the host
> >> > filesystem so it can see the QEMU emulator install / libraries. There
> >> > would also have to be some writable areas per QEMU container. QEMU
> >> > inside the container would be set to run as some non-root UID (from
> >> > the container's POV). So both problem 1 & 3 above would impact
the
> >> > security of this confinement.
> >>
> >> But is there a way to escape confinement? If not, then this is secure.
> >
> > The filesystem UID/GID ownership is the most likely way you can escape
> > the confinement. You would have to be very careful to ensure that each
> > container's view of the filesystem did not include any directories
> > with files that are assigned to another container, since the UID
> > separation would not prevent access to another container's resources.
> >
> > This is rather tedious but could be just about doable, but it gets
> > harder when you throw in things like sysfs and PCI device assignment.
> > eg a guest with PCI device assigned gets given ownership of the files
> > in /sys/bus/pci/devices/0000:00:XX:XX/ and since there is no UID
> > namespacing, this will be accessible to any other container with the
> > same UID. To hack around this when starting up a container you would
> > probably have to bind mount a empty tmpfs over the top of all the
> > PCI device paths you wanted to block in sysfs.
Which of course is easily undoable by root in the container :)
Yep, you'd have to make sure QEMU was none root for it to be at all
practical.
> Ah, I hadn't thought of /sys/bus/pci or /sys/bus/usb!
>
> Thanks for the explanation and it does seem like the design would get messy.
And plenty more, i.e.
http://blog.bofh.it/debian/id_413 See
http://sourceforge.net/mailarchive/message.php?msg_id=27878921
for
someone actively using Smack to help mitigate this (which could also be
done with SELinux).
Yes, I've got the same done with SELinux, but haven't posted it for
review yet, since it needs more testing and some policy additions
Of course in the context of this discussion, QEMU already runs under
SELinux, and my desire for containers was to act as a safety net for
when SELinux fails for some reason (or is disabled by an admin) so
back to square one wrt security :-)
Daniel
--
|: