Am 08.01.2015 um 14:02 schrieb Daniel P. Berrange:
We have historically done a number of things with LXC that are
somewhat questionable in retrospect
1. Mounted /proc/sys read-only, but then mounted
/proc/sys/net/ipv* read-write again
2. Mounted /sys read only
3. Mount /sys/fs/cgroup/NNN/the/guest/dir to /sys/fs/cgroup/NNN
4. FUSE mount on /proc/meminfo
Items 1 & 2 are pointless as they offer no security benefit either
with or without user namespaces. Without userns it is always insecure,
with userns it is always secure, no matter what the mount state is.
I agree. Thanks a lot for addressing this, Daniel!
Item 3 is some what dubious, since /proc/self/cgroup paths for
processes are now not visible at /sys/fs/cgroup. This really
confuses systemd inside the container making it create a broken
layout
The question is, how to support systemd in containers?
As of now I'm not aware of a working concept.
With current libvirt it kind of works but recently I found a very nasty issue:
See:
https://www.redhat.com/archives/libvir-list/2014-November/msg01090.html
Maybe with cgroup namespaces it works. i.e. such that systemd can mount cgroupfs
within the container in a secure way.
The current discussion can be found here:
https://lkml.org/lkml/2015/1/7/150
As of now I have to drop all my systemd lxc guests and will replace them by
a non-systemd distro, which is very sad. :-(
Item 4 is some what dubious, since we're only changing some of
the
fields in /proc/meminfo. It helps apps which blindly parse
/proc/meminfo to determine free system resources they can consume.
Those apps are broken even without containers being involved though,
since any application must expect to be placed inside a cgroup with
limited resources. Faking /proc/meminfo is a pretty limited workaround
that just delays the inevitable fixing of such apps..
You mean that tools like free(1) have to be patched to query also
memory limits from cgroupfs?
The patch that follows just removes the items 1 & 2, but I'm
thinking
we should go further and remove items 3 & 4 too.
Changing 4 in particular though is certainly classed as a guest ABI
change though, so is not something distros may wish to see when
upgrading libvirt. There is scope to argue that 1-3 are guest ABI
changes too
In full machine virt world, we deal with this using machine types.
eg each new KVM version introduces a new machine type which models
the guest ABI in a stable fashion. Guest machine types are fixed at
time of first deployment. So when libvirt / KVM is upgraded, existing
guests will not see any changes, but new guests will automatically
get the new machine type.
I'm thinking we might want make use of this in LXC before making
these changes. eg introduce a new machine 'libvirt-lxc-1' to
represent the current guest mount setup and make sure all existing
guests get that machine type. Then introduce a new machine type
libvirt-lxc-2 that removes all this cruft, which new guests will
get by default.
Alternatively we could call them 'libvirt-lxc-compat-1' and
'libvirt-lxc-bare-1' to give a clearer indication of their
functional difference and version them separately in the future ?
Can we have a new machine type which enforces user namespaces?
Regards,
Daniel
Daniel P. Berrange (1):
lxc: Stop mouning /proc and /sys read only
src/lxc/lxc_container.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
Acked-by: Richard Weinberger <richard(a)nod.at>
Thanks,
//richard