On Thu, Jan 08, 2015 at 02:36:36PM +0100, Richard Weinberger wrote:
Am 08.01.2015 um 14:02 schrieb Daniel P. Berrange:
> We have historically done a number of things with LXC that are
> somewhat questionable in retrospect
>
> 1. Mounted /proc/sys read-only, but then mounted
> /proc/sys/net/ipv* read-write again
> 2. Mounted /sys read only
> 3. Mount /sys/fs/cgroup/NNN/the/guest/dir to /sys/fs/cgroup/NNN
> 4. FUSE mount on /proc/meminfo
>
> Items 1 & 2 are pointless as they offer no security benefit either
> with or without user namespaces. Without userns it is always insecure,
> with userns it is always secure, no matter what the mount state is.
I agree. Thanks a lot for addressing this, Daniel!
> Item 3 is some what dubious, since /proc/self/cgroup paths for
> processes are now not visible at /sys/fs/cgroup. This really
> confuses systemd inside the container making it create a broken
> layout
The question is, how to support systemd in containers?
As of now I'm not aware of a working concept.
With current libvirt it kind of works but recently I found a very nasty issue:
See:
https://www.redhat.com/archives/libvir-list/2014-November/msg01090.html
That reply from Lennart suggests systemd should pretty much work,
albeit in a hacky way.
I've not done much in anger with systemd in containers, but I have
found it sufficient for application containers - ie not full OS
containers with interactive sessions.
Maybe with cgroup namespaces it works. i.e. such that systemd can mount cgroupfs
within the container in a secure way.
The current discussion can be found here:
https://lkml.org/lkml/2015/1/7/150
As of now I have to drop all my systemd lxc guests and will replace them by
a non-systemd distro, which is very sad. :-(
> Item 4 is some what dubious, since we're only changing some of the
> fields in /proc/meminfo. It helps apps which blindly parse
> /proc/meminfo to determine free system resources they can consume.
> Those apps are broken even without containers being involved though,
> since any application must expect to be placed inside a cgroup with
> limited resources. Faking /proc/meminfo is a pretty limited workaround
> that just delays the inevitable fixing of such apps..
You mean that tools like free(1) have to be patched to query also
memory limits from cgroupfs?
Not neccessarily. The 'free' tool is said to
"Display amount of free and used memory in the system"
so it is arguably correct that it reports /proc/meminfo of the host
as a whole.
What is broken are applications that are invoking 'free' and then
believing that the values it reports correspond to what the
application is able to use. ie the applications are not taking
account that they might not have ability to use the entire system
resources due to cgroups or containers or both.
> The patch that follows just removes the items 1 & 2, but
I'm thinking
> we should go further and remove items 3 & 4 too.
>
> Changing 4 in particular though is certainly classed as a guest ABI
> change though, so is not something distros may wish to see when
> upgrading libvirt. There is scope to argue that 1-3 are guest ABI
> changes too
>
> In full machine virt world, we deal with this using machine types.
> eg each new KVM version introduces a new machine type which models
> the guest ABI in a stable fashion. Guest machine types are fixed at
> time of first deployment. So when libvirt / KVM is upgraded, existing
> guests will not see any changes, but new guests will automatically
> get the new machine type.
>
> I'm thinking we might want make use of this in LXC before making
> these changes. eg introduce a new machine 'libvirt-lxc-1' to
> represent the current guest mount setup and make sure all existing
> guests get that machine type. Then introduce a new machine type
> libvirt-lxc-2 that removes all this cruft, which new guests will
> get by default.
>
> Alternatively we could call them 'libvirt-lxc-compat-1' and
> 'libvirt-lxc-bare-1' to give a clearer indication of their
> functional difference and version them separately in the future ?
Can we have a new machine type which enforces user namespaces?
Hmm, I'm not sure that would work. Not least because we need a way to
assume the UID/GID mapping, and the filesystems used with the container
need to have the right UID/GID permissions setup. IOW I don't think
user ns is something we can transparently / automatically turn on.
Regards,
Daniel
--
|:
http://berrange.com -o-
http://www.flickr.com/photos/dberrange/ :|
|:
http://libvirt.org -o-
http://virt-manager.org :|
|:
http://autobuild.org -o-
http://search.cpan.org/~danberr/ :|
|:
http://entangle-photo.org -o-
http://live.gnome.org/gtk-vnc :|