[libvirt] [PATCH] lxc: Cleaning up mount setup

We have historically done a number of things with LXC that are somewhat questionable in retrospect 1. Mounted /proc/sys read-only, but then mounted /proc/sys/net/ipv* read-write again 2. Mounted /sys read only 3. Mount /sys/fs/cgroup/NNN/the/guest/dir to /sys/fs/cgroup/NNN 4. FUSE mount on /proc/meminfo Items 1 & 2 are pointless as they offer no security benefit either with or without user namespaces. Without userns it is always insecure, with userns it is always secure, no matter what the mount state is. Item 3 is some what dubious, since /proc/self/cgroup paths for processes are now not visible at /sys/fs/cgroup. This really confuses systemd inside the container making it create a broken layout Item 4 is some what dubious, since we're only changing some of the fields in /proc/meminfo. It helps apps which blindly parse /proc/meminfo to determine free system resources they can consume. Those apps are broken even without containers being involved though, since any application must expect to be placed inside a cgroup with limited resources. Faking /proc/meminfo is a pretty limited workaround that just delays the inevitable fixing of such apps.. The patch that follows just removes the items 1 & 2, but I'm thinking we should go further and remove items 3 & 4 too. Changing 4 in particular though is certainly classed as a guest ABI change though, so is not something distros may wish to see when upgrading libvirt. There is scope to argue that 1-3 are guest ABI changes too In full machine virt world, we deal with this using machine types. eg each new KVM version introduces a new machine type which models the guest ABI in a stable fashion. Guest machine types are fixed at time of first deployment. So when libvirt / KVM is upgraded, existing guests will not see any changes, but new guests will automatically get the new machine type. I'm thinking we might want make use of this in LXC before making these changes. eg introduce a new machine 'libvirt-lxc-1' to represent the current guest mount setup and make sure all existing guests get that machine type. Then introduce a new machine type libvirt-lxc-2 that removes all this cruft, which new guests will get by default. Alternatively we could call them 'libvirt-lxc-compat-1' and 'libvirt-lxc-bare-1' to give a clearer indication of their functional difference and version them separately in the future ? Regards, Daniel Daniel P. Berrange (1): lxc: Stop mouning /proc and /sys read only src/lxc/lxc_container.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) -- 2.1.0

Mounting parts of /proc and /sys read only provides no security without user namespaces, since root has privilege to remount them writable again. When user namepaces are enable, if offers no security benefit, since the UID remapping already prevents write access to the correct areas. --- src/lxc/lxc_container.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 380d136..a764865 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -850,11 +850,18 @@ typedef struct { } virLXCBasicMountInfo; static const virLXCBasicMountInfo lxcBasicMounts[] = { + /* + * Leave these read-write. In non-user-namespace scenario, making them + * read-only provides no security since root can just remount them + * writeable again. In a user-namespace scenario, the UID/GID mappings + * will already prevent root from doing anything bad to files, so + * there's no gain to making them read-only + */ { "proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, - { "/proc/sys", "/proc/sys", NULL, MS_BIND|MS_RDONLY, false, false, false }, - { "/.oldroot/proc/sys/net/ipv4", "/proc/sys/net/ipv4", NULL, MS_BIND, false, false, true }, - { "/.oldroot/proc/sys/net/ipv6", "/proc/sys/net/ipv6", NULL, MS_BIND, false, false, true }, - { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, false, false, false }, + { "sysfs", "/sys", "sysfs", MS_NOSUID|MS_NOEXEC|MS_NODEV, false, false, false }, + /* These two are marked RDONLY not as security protection mechanism, + but to indicate to userspace that LSMs are not available inside + the container */ { "securityfs", "/sys/kernel/security", "securityfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, true, true, false }, #if WITH_SELINUX { SELINUX_MOUNT, SELINUX_MOUNT, "selinuxfs", MS_NOSUID|MS_NOEXEC|MS_NODEV|MS_RDONLY, true, true, false }, -- 2.1.0

-----Original Message----- From: Daniel P. Berrange [mailto:berrange@redhat.com] Sent: Thursday, January 08, 2015 9:03 PM To: libvir-list@redhat.com Cc: Richard Weinberger; Chen, Hanxiao/陈 晗霄; Daniel P. Berrange Subject: [PATCH] lxc: Stop mouning /proc and /sys read only
Mounting parts of /proc and /sys read only provides no security without user namespaces, since root has privilege to remount them writable again. When user namepaces are enable, if offers no security benefit, since the UID remapping already prevents write access to the correct areas. --- src/lxc/lxc_container.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-)
ACK. We also need to do some cleanups in lxcContainerMountBasicFS; also for commit: ba9b7252ea8d87dfa217fb11dc5dadc039176807 Thanks, - Chen

Am 08.01.2015 um 14:02 schrieb Daniel P. Berrange:
We have historically done a number of things with LXC that are somewhat questionable in retrospect
1. Mounted /proc/sys read-only, but then mounted /proc/sys/net/ipv* read-write again 2. Mounted /sys read only 3. Mount /sys/fs/cgroup/NNN/the/guest/dir to /sys/fs/cgroup/NNN 4. FUSE mount on /proc/meminfo
Items 1 & 2 are pointless as they offer no security benefit either with or without user namespaces. Without userns it is always insecure, with userns it is always secure, no matter what the mount state is.
I agree. Thanks a lot for addressing this, Daniel!
Item 3 is some what dubious, since /proc/self/cgroup paths for processes are now not visible at /sys/fs/cgroup. This really confuses systemd inside the container making it create a broken layout
The question is, how to support systemd in containers? As of now I'm not aware of a working concept. With current libvirt it kind of works but recently I found a very nasty issue: See: https://www.redhat.com/archives/libvir-list/2014-November/msg01090.html Maybe with cgroup namespaces it works. i.e. such that systemd can mount cgroupfs within the container in a secure way. The current discussion can be found here: https://lkml.org/lkml/2015/1/7/150 As of now I have to drop all my systemd lxc guests and will replace them by a non-systemd distro, which is very sad. :-(
Item 4 is some what dubious, since we're only changing some of the fields in /proc/meminfo. It helps apps which blindly parse /proc/meminfo to determine free system resources they can consume. Those apps are broken even without containers being involved though, since any application must expect to be placed inside a cgroup with limited resources. Faking /proc/meminfo is a pretty limited workaround that just delays the inevitable fixing of such apps..
You mean that tools like free(1) have to be patched to query also memory limits from cgroupfs?
The patch that follows just removes the items 1 & 2, but I'm thinking we should go further and remove items 3 & 4 too.
Changing 4 in particular though is certainly classed as a guest ABI change though, so is not something distros may wish to see when upgrading libvirt. There is scope to argue that 1-3 are guest ABI changes too
In full machine virt world, we deal with this using machine types. eg each new KVM version introduces a new machine type which models the guest ABI in a stable fashion. Guest machine types are fixed at time of first deployment. So when libvirt / KVM is upgraded, existing guests will not see any changes, but new guests will automatically get the new machine type.
I'm thinking we might want make use of this in LXC before making these changes. eg introduce a new machine 'libvirt-lxc-1' to represent the current guest mount setup and make sure all existing guests get that machine type. Then introduce a new machine type libvirt-lxc-2 that removes all this cruft, which new guests will get by default.
Alternatively we could call them 'libvirt-lxc-compat-1' and 'libvirt-lxc-bare-1' to give a clearer indication of their functional difference and version them separately in the future ?
Can we have a new machine type which enforces user namespaces?
Regards, Daniel
Daniel P. Berrange (1): lxc: Stop mouning /proc and /sys read only
src/lxc/lxc_container.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-)
Acked-by: Richard Weinberger <richard@nod.at> Thanks, //richard

On Thu, Jan 08, 2015 at 02:36:36PM +0100, Richard Weinberger wrote:
Am 08.01.2015 um 14:02 schrieb Daniel P. Berrange:
We have historically done a number of things with LXC that are somewhat questionable in retrospect
1. Mounted /proc/sys read-only, but then mounted /proc/sys/net/ipv* read-write again 2. Mounted /sys read only 3. Mount /sys/fs/cgroup/NNN/the/guest/dir to /sys/fs/cgroup/NNN 4. FUSE mount on /proc/meminfo
Items 1 & 2 are pointless as they offer no security benefit either with or without user namespaces. Without userns it is always insecure, with userns it is always secure, no matter what the mount state is.
I agree. Thanks a lot for addressing this, Daniel!
Item 3 is some what dubious, since /proc/self/cgroup paths for processes are now not visible at /sys/fs/cgroup. This really confuses systemd inside the container making it create a broken layout
The question is, how to support systemd in containers?
As of now I'm not aware of a working concept. With current libvirt it kind of works but recently I found a very nasty issue: See: https://www.redhat.com/archives/libvir-list/2014-November/msg01090.html
That reply from Lennart suggests systemd should pretty much work, albeit in a hacky way. I've not done much in anger with systemd in containers, but I have found it sufficient for application containers - ie not full OS containers with interactive sessions.
Maybe with cgroup namespaces it works. i.e. such that systemd can mount cgroupfs within the container in a secure way. The current discussion can be found here: https://lkml.org/lkml/2015/1/7/150
As of now I have to drop all my systemd lxc guests and will replace them by a non-systemd distro, which is very sad. :-(
Item 4 is some what dubious, since we're only changing some of the fields in /proc/meminfo. It helps apps which blindly parse /proc/meminfo to determine free system resources they can consume. Those apps are broken even without containers being involved though, since any application must expect to be placed inside a cgroup with limited resources. Faking /proc/meminfo is a pretty limited workaround that just delays the inevitable fixing of such apps..
You mean that tools like free(1) have to be patched to query also memory limits from cgroupfs?
Not neccessarily. The 'free' tool is said to "Display amount of free and used memory in the system" so it is arguably correct that it reports /proc/meminfo of the host as a whole. What is broken are applications that are invoking 'free' and then believing that the values it reports correspond to what the application is able to use. ie the applications are not taking account that they might not have ability to use the entire system resources due to cgroups or containers or both.
The patch that follows just removes the items 1 & 2, but I'm thinking we should go further and remove items 3 & 4 too.
Changing 4 in particular though is certainly classed as a guest ABI change though, so is not something distros may wish to see when upgrading libvirt. There is scope to argue that 1-3 are guest ABI changes too
In full machine virt world, we deal with this using machine types. eg each new KVM version introduces a new machine type which models the guest ABI in a stable fashion. Guest machine types are fixed at time of first deployment. So when libvirt / KVM is upgraded, existing guests will not see any changes, but new guests will automatically get the new machine type.
I'm thinking we might want make use of this in LXC before making these changes. eg introduce a new machine 'libvirt-lxc-1' to represent the current guest mount setup and make sure all existing guests get that machine type. Then introduce a new machine type libvirt-lxc-2 that removes all this cruft, which new guests will get by default.
Alternatively we could call them 'libvirt-lxc-compat-1' and 'libvirt-lxc-bare-1' to give a clearer indication of their functional difference and version them separately in the future ?
Can we have a new machine type which enforces user namespaces?
Hmm, I'm not sure that would work. Not least because we need a way to assume the UID/GID mapping, and the filesystems used with the container need to have the right UID/GID permissions setup. IOW I don't think user ns is something we can transparently / automatically turn on. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Am 08.01.2015 um 14:45 schrieb Daniel P. Berrange:
On Thu, Jan 08, 2015 at 02:36:36PM +0100, Richard Weinberger wrote:
Am 08.01.2015 um 14:02 schrieb Daniel P. Berrange:
We have historically done a number of things with LXC that are somewhat questionable in retrospect
1. Mounted /proc/sys read-only, but then mounted /proc/sys/net/ipv* read-write again 2. Mounted /sys read only 3. Mount /sys/fs/cgroup/NNN/the/guest/dir to /sys/fs/cgroup/NNN 4. FUSE mount on /proc/meminfo
Items 1 & 2 are pointless as they offer no security benefit either with or without user namespaces. Without userns it is always insecure, with userns it is always secure, no matter what the mount state is.
I agree. Thanks a lot for addressing this, Daniel!
Item 3 is some what dubious, since /proc/self/cgroup paths for processes are now not visible at /sys/fs/cgroup. This really confuses systemd inside the container making it create a broken layout
The question is, how to support systemd in containers?
As of now I'm not aware of a working concept. With current libvirt it kind of works but recently I found a very nasty issue: See: https://www.redhat.com/archives/libvir-list/2014-November/msg01090.html
That reply from Lennart suggests systemd should pretty much work, albeit in a hacky way.
What hack to you mean? *confused*
I've not done much in anger with systemd in containers, but I have found it sufficient for application containers - ie not full OS containers with interactive sessions.
My use case is different. I need most of the time at least an init. And if the distro is systemd based....
Maybe with cgroup namespaces it works. i.e. such that systemd can mount cgroupfs within the container in a secure way. The current discussion can be found here: https://lkml.org/lkml/2015/1/7/150
As of now I have to drop all my systemd lxc guests and will replace them by a non-systemd distro, which is very sad. :-(
Item 4 is some what dubious, since we're only changing some of the fields in /proc/meminfo. It helps apps which blindly parse /proc/meminfo to determine free system resources they can consume. Those apps are broken even without containers being involved though, since any application must expect to be placed inside a cgroup with limited resources. Faking /proc/meminfo is a pretty limited workaround that just delays the inevitable fixing of such apps..
You mean that tools like free(1) have to be patched to query also memory limits from cgroupfs?
Not neccessarily. The 'free' tool is said to
"Display amount of free and used memory in the system"
so it is arguably correct that it reports /proc/meminfo of the host as a whole.
What is broken are applications that are invoking 'free' and then believing that the values it reports correspond to what the application is able to use. ie the applications are not taking account that they might not have ability to use the entire system resources due to cgroups or containers or both.
The patch that follows just removes the items 1 & 2, but I'm thinking we should go further and remove items 3 & 4 too.
Changing 4 in particular though is certainly classed as a guest ABI change though, so is not something distros may wish to see when upgrading libvirt. There is scope to argue that 1-3 are guest ABI changes too
In full machine virt world, we deal with this using machine types. eg each new KVM version introduces a new machine type which models the guest ABI in a stable fashion. Guest machine types are fixed at time of first deployment. So when libvirt / KVM is upgraded, existing guests will not see any changes, but new guests will automatically get the new machine type.
I'm thinking we might want make use of this in LXC before making these changes. eg introduce a new machine 'libvirt-lxc-1' to represent the current guest mount setup and make sure all existing guests get that machine type. Then introduce a new machine type libvirt-lxc-2 that removes all this cruft, which new guests will get by default.
Alternatively we could call them 'libvirt-lxc-compat-1' and 'libvirt-lxc-bare-1' to give a clearer indication of their functional difference and version them separately in the future ?
Can we have a new machine type which enforces user namespaces?
Hmm, I'm not sure that would work. Not least because we need a way to assume the UID/GID mapping, and the filesystems used with the container need to have the right UID/GID permissions setup. IOW I don't think user ns is something we can transparently / automatically turn on.
Yeah but we have to warn the user that she is doing something insecure if no mappings are set up. Thanks, //richard

On Thu, Jan 08, 2015 at 03:02:59PM +0100, Richard Weinberger wrote:
Am 08.01.2015 um 14:45 schrieb Daniel P. Berrange:
On Thu, Jan 08, 2015 at 02:36:36PM +0100, Richard Weinberger wrote:
Am 08.01.2015 um 14:02 schrieb Daniel P. Berrange:
We have historically done a number of things with LXC that are somewhat questionable in retrospect
1. Mounted /proc/sys read-only, but then mounted /proc/sys/net/ipv* read-write again 2. Mounted /sys read only 3. Mount /sys/fs/cgroup/NNN/the/guest/dir to /sys/fs/cgroup/NNN 4. FUSE mount on /proc/meminfo
Items 1 & 2 are pointless as they offer no security benefit either with or without user namespaces. Without userns it is always insecure, with userns it is always secure, no matter what the mount state is.
I agree. Thanks a lot for addressing this, Daniel!
Item 3 is some what dubious, since /proc/self/cgroup paths for processes are now not visible at /sys/fs/cgroup. This really confuses systemd inside the container making it create a broken layout
The question is, how to support systemd in containers?
As of now I'm not aware of a working concept. With current libvirt it kind of works but recently I found a very nasty issue: See: https://www.redhat.com/archives/libvir-list/2014-November/msg01090.html
That reply from Lennart suggests systemd should pretty much work, albeit in a hacky way.
What hack to you mean?
Lennarts reply detailing their workaround hacks: "Our current strategy for still being able to clean everything up is this: [snip details] Complex? Awful? Disgusting? Yes, absolutely. But as far as I can see it should actually be good enough to all cases I ran into."
I've not done much in anger with systemd in containers, but I have found it sufficient for application containers - ie not full OS containers with interactive sessions.
My use case is different. I need most of the time at least an init. And if the distro is systemd based....
When I said application containers there, I meant running with systemd, but setup so it only runs a specific set of unit files enough to launch the desired app, rather than running the full default Fedora OS unit set.
Can we have a new machine type which enforces user namespaces?
Hmm, I'm not sure that would work. Not least because we need a way to assume the UID/GID mapping, and the filesystems used with the container need to have the right UID/GID permissions setup. IOW I don't think user ns is something we can transparently / automatically turn on.
Yeah but we have to warn the user that she is doing something insecure if no mappings are set up.
Ultimately I think that's a docs problem, or something that a higher level app needs to deal with. eg OpenStack should setup LXC such that user namespaces are unconditionally enabled all the time, even if that's not the case in libvirt itself. OpenStack manages the whole machine, so it has enough context to do the setup that libvirt cannot do. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Am 08.01.2015 um 15:06 schrieb Daniel P. Berrange:
On Thu, Jan 08, 2015 at 03:02:59PM +0100, Richard Weinberger wrote:
Am 08.01.2015 um 14:45 schrieb Daniel P. Berrange:
On Thu, Jan 08, 2015 at 02:36:36PM +0100, Richard Weinberger wrote:
Am 08.01.2015 um 14:02 schrieb Daniel P. Berrange:
We have historically done a number of things with LXC that are somewhat questionable in retrospect
1. Mounted /proc/sys read-only, but then mounted /proc/sys/net/ipv* read-write again 2. Mounted /sys read only 3. Mount /sys/fs/cgroup/NNN/the/guest/dir to /sys/fs/cgroup/NNN 4. FUSE mount on /proc/meminfo
Items 1 & 2 are pointless as they offer no security benefit either with or without user namespaces. Without userns it is always insecure, with userns it is always secure, no matter what the mount state is.
I agree. Thanks a lot for addressing this, Daniel!
Item 3 is some what dubious, since /proc/self/cgroup paths for processes are now not visible at /sys/fs/cgroup. This really confuses systemd inside the container making it create a broken layout
The question is, how to support systemd in containers?
As of now I'm not aware of a working concept. With current libvirt it kind of works but recently I found a very nasty issue: See: https://www.redhat.com/archives/libvir-list/2014-November/msg01090.html
That reply from Lennart suggests systemd should pretty much work, albeit in a hacky way.
What hack to you mean?
Lennarts reply detailing their workaround hacks:
Oh yes. But these do not work as I've stated in the mail. My containers show thousands of orphaned login sessions and render the container unusable after some time.
My use case is different. I need most of the time at least an init. And if the distro is systemd based.... Yeah but we have to warn the user that she is doing something insecure if no mappings are set up.
Ultimately I think that's a docs problem, or something that a higher level app needs to deal with. eg OpenStack should setup LXC such that user namespaces are unconditionally enabled all the time, even if that's not the case in libvirt itself. OpenStack manages the whole machine, so it has enough context to do the setup that libvirt cannot do.
I don't run OpenStack but i tend to agree. :-) Thanks, //richard
participants (3)
-
Chen, Hanxiao
-
Daniel P. Berrange
-
Richard Weinberger