On Wed, Aug 24, 2011 at 3:46 PM, Daniel P. Berrange
<berrange(a)redhat.com> wrote:
> On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote:
>> On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange <berrange(a)redhat.com>
wrote:
>> > On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote:
>> >> On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange
>> >> <berrange(a)redhat.com> wrote:
>> >> > I was at the KVM Forum / LinuxCon last week and there were many
>> >> > interesting things discussed which are relevant to ongoing
libvirt
>> >> > development. Here was the list that caught my attention. If I
have
>> >> > missed any, fill in the gaps....
>> >> >
>> >> > - Sandbox/container KVM. The Solaris port of KVM puts QEMU
inside
>> >> > a zone so that an exploit of QEMU can't escape into the full
OS.
>> >> > Containers are Linux's parallel of Zones, and while not
nearly as
>> >> > secure yet, it would still be worth using more containers
support
>> >> > to confine QEMU.
>> >>
>> >> Can you elaborate on why Linux containers are "not nearly as
secure"
>> >> [as Solaris Zones]?
>> >
>> > Mostly because the Linux namespace functionality is far from complete,
>> > notably lacking proper UID/GID/capability separation, and UID/GID
>> > virtualization wrt filesystems. The longer answer is here:
>> >
>> >
https://wiki.ubuntu.com/UserNamespace
>> >
>> > So at this time you can't build a secure container on Linux, relying
>> > just on DAC alone. You have to add in a MAC layer ontop of the container
>> > to get full security benefits, which obviously defeats the point of
>> > using the container as a backup for failure in the MAC layer.
>>
>> Thanks, that is interesting. I still don't understand why that is a
>> problem. Linux containers (lxc) uses a different pid namespace (no
>> ptrace worries), file system root (restricted to a subdirectory tree),
>> forbids most device nodes, etc. Why does the user namespace matter
>> for security in this case?
>
> A number of reasons really...
>
> If user ID '0' on the host starts a container, and a process inside
> the container does 'setuid(500)', then any user outside the container
> with UID 500 will be able to kill that process. Only user ID '0' should
> have been allowed todo that.
>
> It will also let non-root user IDs on the host OS, start containers
> and have root uid=0 inside the container.
>
> Finally, any files created inside the container with, say, uid 500
> will be accessible by any other process with UID 500, in either the
> host or any other container
These points mean that the host can peek inside containers and has
access to their processes/files. But from the point of a libvirt
running inside a container there is no security problem.
This is kind of like saying that root on the host can modify KVM guest
disk images. That is true but I don't see it as a security problem
because the root on the host is the trusted part of the system.
>> I think it matters when giving multiple containers access to the same
>> file system. Is that what you'd like to do for libvirt?
>
> Each container would have to share a (readonly) view onto the host
> filesystem so it can see the QEMU emulator install / libraries. There
> would also have to be some writable areas per QEMU container. QEMU
> inside the container would be set to run as some non-root UID (from
> the container's POV). So both problem 1 & 3 above would impact the
> security of this confinement.
But is there a way to escape confinement? If not, then this is secure.
The filesystem UID/GID ownership is the most likely way you can escape
the confinement. You would have to be very careful to ensure that each
container's view of the filesystem did not include any directories
with files that are assigned to another container, since the UID
separation would not prevent access to another container's resources.
This is rather tedious but could be just about doable, but it gets
harder when you throw in things like sysfs and PCI device assignment.
eg a guest with PCI device assigned gets given ownership of the files
in /sys/bus/pci/devices/0000:00:XX:XX/ and since there is no UID
namespacing, this will be accessible to any other container with the
same UID. To hack around this when starting up a container you would
probably have to bind mount a empty tmpfs over the top of all the
PCI device paths you wanted to block in sysfs.
Obviously you can get around this by running each guest as a different
user ID, but this is one of the things we wanted to avoid by using
containers & it ought to not be needed if containers were actually
secure.
Daniel
--
|: