[libvirt] [RFC PATCH 1/2] LXC: Drop capabilities only if we're not within a user namespace

Dropping capabilities within a user namespace makes no sense because any uid 0 process will regain all caps upon execve(). Signed-off-by: Richard Weinberger <richard@nod.at> --- src/lxc/lxc_container.c | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 958e20d..4f00420 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -1896,6 +1896,15 @@ static int lxcContainerDropCapabilities(bool keepReboot ATTRIBUTE_UNUSED) return 0; } +static int userns_supported(void) +{ + return lxcContainerAvailable(LXC_CONTAINER_FEATURE_USER) == 0; +} + +static int userns_required(virDomainDefPtr def) +{ + return def->idmap.uidmap && def->idmap.gidmap; +} /** * lxcContainerChild: @@ -1992,7 +2001,7 @@ static int lxcContainerChild(void *data) } /* drop a set of root capabilities */ - if (lxcContainerDropCapabilities(!!hasReboot) < 0) + if (!userns_required(vmDef) && lxcContainerDropCapabilities(!!hasReboot) < 0) goto cleanup; if (lxcContainerSendContinue(argv->handshakefd) < 0) { @@ -2025,16 +2034,6 @@ cleanup: return ret; } -static int userns_supported(void) -{ - return lxcContainerAvailable(LXC_CONTAINER_FEATURE_USER) == 0; -} - -static int userns_required(virDomainDefPtr def) -{ - return def->idmap.uidmap && def->idmap.gidmap; -} - virArch lxcContainerGetAlt32bitArch(virArch arch) { /* Any Linux 64bit arch which has a 32bit -- 1.8.1.4

Within a user namespace root can remount these filesysems at any time rw. Create these mappings only if we're not playing with user namespaces. Signed-off-by: Richard Weinberger <richard@nod.at> --- src/lxc/lxc_container.c | 42 +++++++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 19 deletions(-) diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 4f00420..a003ec8 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -682,8 +682,17 @@ err: return ret; } +static int userns_supported(void) +{ + return lxcContainerAvailable(LXC_CONTAINER_FEATURE_USER) == 0; +} -static int lxcContainerMountBasicFS(void) +static int userns_required(virDomainDefPtr def) +{ + return def->idmap.uidmap && def->idmap.gidmap; +} + +static int lxcContainerMountBasicFS(virDomainDefPtr vmDef) { const struct { const char *src; @@ -691,6 +700,7 @@ static int lxcContainerMountBasicFS(void) const char *type; const char *opts; int mflags; + bool paranoia; } mnts[] = { /* When we want to make a bind mount readonly, for unknown reasons, * it is currently necessary to bind it once, and then remount the @@ -698,14 +708,14 @@ static int lxcContainerMountBasicFS(void) * mount point in the main OS becomes readonly too which is not what * we want. Hence some things have two entries here. */ - { "proc", "/proc", "proc", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV }, - { "/proc/sys", "/proc/sys", NULL, NULL, MS_BIND }, - { "/proc/sys", "/proc/sys", NULL, NULL, MS_BIND|MS_REMOUNT|MS_RDONLY }, - { "sysfs", "/sys", "sysfs", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV }, - { "sysfs", "/sys", "sysfs", NULL, MS_BIND|MS_REMOUNT|MS_RDONLY }, + { "proc", "/proc", "proc", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV, false }, + { "/proc/sys", "/proc/sys", NULL, NULL, MS_BIND, true }, + { "/proc/sys", "/proc/sys", NULL, NULL, MS_BIND|MS_REMOUNT|MS_RDONLY, true }, + { "sysfs", "/sys", "sysfs", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV, false }, + { "sysfs", "/sys", "sysfs", NULL, MS_BIND|MS_REMOUNT|MS_RDONLY, true }, #if WITH_SELINUX - { SELINUX_MOUNT, SELINUX_MOUNT, "selinuxfs", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV }, - { SELINUX_MOUNT, SELINUX_MOUNT, NULL, NULL, MS_BIND|MS_REMOUNT|MS_RDONLY }, + { SELINUX_MOUNT, SELINUX_MOUNT, "selinuxfs", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV, false }, + { SELINUX_MOUNT, SELINUX_MOUNT, NULL, NULL, MS_BIND|MS_REMOUNT|MS_RDONLY, true }, #endif }; int i, rc = -1; @@ -720,6 +730,10 @@ static int lxcContainerMountBasicFS(void) srcpath = mnts[i].src; + /* Skip ro overlay mounts if we build a userns as root can remount it rw at any time */ + if (userns_required(vmDef) && mnts[i].paranoia) + continue; + /* Skip if mount doesn't exist in source */ if ((srcpath[0] == '/') && (access(srcpath, R_OK) < 0)) @@ -1780,7 +1794,7 @@ static int lxcContainerSetupPivotRoot(virDomainDefPtr vmDef, goto cleanup; /* Mounts the core /proc, /sys, etc filesystems */ - if (lxcContainerMountBasicFS() < 0) + if (lxcContainerMountBasicFS(vmDef) < 0) goto cleanup; /* Mounts /proc/meminfo etc sysinfo */ @@ -1896,16 +1910,6 @@ static int lxcContainerDropCapabilities(bool keepReboot ATTRIBUTE_UNUSED) return 0; } -static int userns_supported(void) -{ - return lxcContainerAvailable(LXC_CONTAINER_FEATURE_USER) == 0; -} - -static int userns_required(virDomainDefPtr def) -{ - return def->idmap.uidmap && def->idmap.gidmap; -} - /** * lxcContainerChild: * @data: pointer to container arguments -- 1.8.1.4

On Thu, Jun 13, 2013 at 08:02:18PM +0200, Richard Weinberger wrote:
Within a user namespace root can remount these filesysems at any time rw. Create these mappings only if we're not playing with user namespaces.
This is a problem with the way we're initializing mounts in the user namespace. We need to ensure that the initial mounts setup by libvirt can't be changed by admin inside the container. Preventing the container admin from remounting or unmounting these mounts is key to security. IIUC, the only way to ensure this is to start a new user namespace /after/ setting up all mounts. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 06/26/2013 04:39 AM, Daniel P. Berrange wrote:
On Thu, Jun 13, 2013 at 08:02:18PM +0200, Richard Weinberger wrote:
Within a user namespace root can remount these filesysems at any time rw. Create these mappings only if we're not playing with user namespaces.
This is a problem with the way we're initializing mounts in the user namespace.
This problem exists even libvirt lxc doesn't support user namespace.
We need to ensure that the initial mounts setup by libvirt can't be changed by admin inside the container. Preventing the container admin from remounting or unmounting these mounts is key to security.
IIUC, the only way to ensure this is to start a new user namespace /after/ setting up all mounts.
start a new user namespace means the container will lose controller of mount namespace. so the container can't do mount operation too, though we only can mount a little of filesystems in un-init user namespace. So maybe we should fix this problem by selinux.

On Wed, Jun 26, 2013 at 10:26:10AM +0800, Gao feng wrote:
On 06/26/2013 04:39 AM, Daniel P. Berrange wrote:
On Thu, Jun 13, 2013 at 08:02:18PM +0200, Richard Weinberger wrote:
Within a user namespace root can remount these filesysems at any time rw. Create these mappings only if we're not playing with user namespaces.
This is a problem with the way we're initializing mounts in the user namespace.
This problem exists even libvirt lxc doesn't support user namespace.
Yes, and this is a problem that user namespace is intended to solve.
We need to ensure that the initial mounts setup by libvirt can't be changed by admin inside the container. Preventing the container admin from remounting or unmounting these mounts is key to security.
IIUC, the only way to ensure this is to start a new user namespace /after/ setting up all mounts.
start a new user namespace means the container will lose controller of mount namespace. so the container can't do mount operation too, though we only can mount a little of filesystems in un-init user namespace.
Merely being able to unmount is sufficient to exploit the host. Consider that the container was configured with the following mapping / -> / /export/mycontainer/home -> /home Now, if the container admin can umount /home, then they can now see the home directory contents of the host. At least this is likely to be information leakage, and if any of the host home directories have UIDs that overlap with those assigned to the container ID map, you have a potentially exploitable situation. Hence we need to ensure that the container cannot unmount or remount anything setup by libvirt. AFAICT, this means that all the mounts libvirt does, must be performed in a seprate user namespace to that wit hthe container will eventually run in.
So maybe we should fix this problem by selinux.
User namespaces are intended to allow for secure containers without the need to run SELinux. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 06/26/2013 05:38 PM, Daniel P. Berrange wrote:
On Wed, Jun 26, 2013 at 10:26:10AM +0800, Gao feng wrote:
On 06/26/2013 04:39 AM, Daniel P. Berrange wrote:
On Thu, Jun 13, 2013 at 08:02:18PM +0200, Richard Weinberger wrote:
Within a user namespace root can remount these filesysems at any time rw. Create these mappings only if we're not playing with user namespaces.
This is a problem with the way we're initializing mounts in the user namespace.
This problem exists even libvirt lxc doesn't support user namespace.
Yes, and this is a problem that user namespace is intended to solve.
We need to ensure that the initial mounts setup by libvirt can't be changed by admin inside the container. Preventing the container admin from remounting or unmounting these mounts is key to security.
IIUC, the only way to ensure this is to start a new user namespace /after/ setting up all mounts.
start a new user namespace means the container will lose controller of mount namespace. so the container can't do mount operation too, though we only can mount a little of filesystems in un-init user namespace.
Merely being able to unmount is sufficient to exploit the host. Consider that the container was configured with the following mapping
/ -> / /export/mycontainer/home -> /home
Now, if the container admin can umount /home, then they can now see the home directory contents of the host. At least this is likely to be information leakage, and if any of the host home directories have UIDs that overlap with those assigned to the container ID map, you have a potentially exploitable situation.
Hence we need to ensure that the container cannot unmount or remount anything setup by libvirt. AFAICT, this means that all the mounts libvirt does, must be performed in a seprate user namespace to that wit hthe container will eventually run in.
Libvirt mounts something for the container in one user namesapce, and then libvirt calls unshare to create a new user namespace and start the init task of container. Yes, the users in container can't do mount/unmount/remount on all of filesystem. but they can call unshare to create a new mount namespace, and they will have rights to mount/unmount/remount in this new created mount namespace. they can still umount /home to see the home directory contents of host. I didn't do test now, but I think this problem is existing. User namespace can't do this job well.
So maybe we should fix this problem by selinux.
User namespaces are intended to allow for secure containers without the need to run SELinux.
Daniel

On Wed, Jun 26, 2013 at 05:56:19PM +0800, Gao feng wrote:
On 06/26/2013 05:38 PM, Daniel P. Berrange wrote:
On Wed, Jun 26, 2013 at 10:26:10AM +0800, Gao feng wrote:
On 06/26/2013 04:39 AM, Daniel P. Berrange wrote:
On Thu, Jun 13, 2013 at 08:02:18PM +0200, Richard Weinberger wrote:
Within a user namespace root can remount these filesysems at any time rw. Create these mappings only if we're not playing with user namespaces.
This is a problem with the way we're initializing mounts in the user namespace.
This problem exists even libvirt lxc doesn't support user namespace.
Yes, and this is a problem that user namespace is intended to solve.
We need to ensure that the initial mounts setup by libvirt can't be changed by admin inside the container. Preventing the container admin from remounting or unmounting these mounts is key to security.
IIUC, the only way to ensure this is to start a new user namespace /after/ setting up all mounts.
start a new user namespace means the container will lose controller of mount namespace. so the container can't do mount operation too, though we only can mount a little of filesystems in un-init user namespace.
Merely being able to unmount is sufficient to exploit the host. Consider that the container was configured with the following mapping
/ -> / /export/mycontainer/home -> /home
Now, if the container admin can umount /home, then they can now see the home directory contents of the host. At least this is likely to be information leakage, and if any of the host home directories have UIDs that overlap with those assigned to the container ID map, you have a potentially exploitable situation.
Hence we need to ensure that the container cannot unmount or remount anything setup by libvirt. AFAICT, this means that all the mounts libvirt does, must be performed in a seprate user namespace to that wit hthe container will eventually run in.
Libvirt mounts something for the container in one user namesapce, and then libvirt calls unshare to create a new user namespace and start the init task of container.
Yes, the users in container can't do mount/unmount/remount on all of filesystem. but they can call unshare to create a new mount namespace, and they will have rights to mount/unmount/remount in this new created mount namespace. they can still umount /home to see the home directory contents of host.
An existing filesystem mount can only be remounted/unmounted by the (user ID, usernamespace) that originally mounted it. So even if you start a new mount namespace, you cannot unmount stuff setup by the parent user namespace. # unshare --mount --user /bin/sh sh-4.2$ umount /sys/kernel/debug umount: /sys/kernel/debug: Invalid argument Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 06/26/2013 07:01 PM, Daniel P. Berrange wrote:
On Wed, Jun 26, 2013 at 05:56:19PM +0800, Gao feng wrote:
On 06/26/2013 05:38 PM, Daniel P. Berrange wrote:
On Wed, Jun 26, 2013 at 10:26:10AM +0800, Gao feng wrote:
On 06/26/2013 04:39 AM, Daniel P. Berrange wrote:
On Thu, Jun 13, 2013 at 08:02:18PM +0200, Richard Weinberger wrote:
Within a user namespace root can remount these filesysems at any time rw. Create these mappings only if we're not playing with user namespaces.
This is a problem with the way we're initializing mounts in the user namespace.
This problem exists even libvirt lxc doesn't support user namespace.
Yes, and this is a problem that user namespace is intended to solve.
We need to ensure that the initial mounts setup by libvirt can't be changed by admin inside the container. Preventing the container admin from remounting or unmounting these mounts is key to security.
IIUC, the only way to ensure this is to start a new user namespace /after/ setting up all mounts.
start a new user namespace means the container will lose controller of mount namespace. so the container can't do mount operation too, though we only can mount a little of filesystems in un-init user namespace.
Merely being able to unmount is sufficient to exploit the host. Consider that the container was configured with the following mapping
/ -> / /export/mycontainer/home -> /home
Now, if the container admin can umount /home, then they can now see the home directory contents of the host. At least this is likely to be information leakage, and if any of the host home directories have UIDs that overlap with those assigned to the container ID map, you have a potentially exploitable situation.
Hence we need to ensure that the container cannot unmount or remount anything setup by libvirt. AFAICT, this means that all the mounts libvirt does, must be performed in a seprate user namespace to that wit hthe container will eventually run in.
Libvirt mounts something for the container in one user namesapce, and then libvirt calls unshare to create a new user namespace and start the init task of container.
Yes, the users in container can't do mount/unmount/remount on all of filesystem. but they can call unshare to create a new mount namespace, and they will have rights to mount/unmount/remount in this new created mount namespace. they can still umount /home to see the home directory contents of host.
An existing filesystem mount can only be remounted/unmounted by the (user ID, usernamespace) that originally mounted it. So even if you start a new mount namespace, you cannot unmount stuff setup by the parent user namespace.
Please also setup the uid_map/gid_map for the unshared user namespace. even in container, user has rights to setup these two files.
# unshare --mount --user /bin/sh sh-4.2$ umount /sys/kernel/debug umount: /sys/kernel/debug: Invalid argument
in terminal one $ id uid=1000(gaofeng) gid=1000(gaofeng) groups=1000(gaofeng) $ ./unshare --mount --user /bin/sh sh-4.2$ echo $$ 17110 sh-4.2$ in other terminal,setup id map for new userns. $echo 0 1000 1 > /proc/17110/uid_map $echo 0 1000 1 > /proc/17110/gid_map and then in terminal one sh-4.2$ umount -l /home/

On Thu, Jun 27, 2013 at 08:56:25AM +0800, Gao feng wrote:
On 06/26/2013 07:01 PM, Daniel P. Berrange wrote:
On Wed, Jun 26, 2013 at 05:56:19PM +0800, Gao feng wrote:
On 06/26/2013 05:38 PM, Daniel P. Berrange wrote:
On Wed, Jun 26, 2013 at 10:26:10AM +0800, Gao feng wrote:
On 06/26/2013 04:39 AM, Daniel P. Berrange wrote:
On Thu, Jun 13, 2013 at 08:02:18PM +0200, Richard Weinberger wrote: > Within a user namespace root can remount these filesysems at any > time rw. > Create these mappings only if we're not playing with user namespaces.
This is a problem with the way we're initializing mounts in the user namespace.
This problem exists even libvirt lxc doesn't support user namespace.
Yes, and this is a problem that user namespace is intended to solve.
We need to ensure that the initial mounts setup by libvirt can't be changed by admin inside the container. Preventing the container admin from remounting or unmounting these mounts is key to security.
IIUC, the only way to ensure this is to start a new user namespace /after/ setting up all mounts.
start a new user namespace means the container will lose controller of mount namespace. so the container can't do mount operation too, though we only can mount a little of filesystems in un-init user namespace.
Merely being able to unmount is sufficient to exploit the host. Consider that the container was configured with the following mapping
/ -> / /export/mycontainer/home -> /home
Now, if the container admin can umount /home, then they can now see the home directory contents of the host. At least this is likely to be information leakage, and if any of the host home directories have UIDs that overlap with those assigned to the container ID map, you have a potentially exploitable situation.
Hence we need to ensure that the container cannot unmount or remount anything setup by libvirt. AFAICT, this means that all the mounts libvirt does, must be performed in a seprate user namespace to that wit hthe container will eventually run in.
Libvirt mounts something for the container in one user namesapce, and then libvirt calls unshare to create a new user namespace and start the init task of container.
Yes, the users in container can't do mount/unmount/remount on all of filesystem. but they can call unshare to create a new mount namespace, and they will have rights to mount/unmount/remount in this new created mount namespace. they can still umount /home to see the home directory contents of host.
An existing filesystem mount can only be remounted/unmounted by the (user ID, usernamespace) that originally mounted it. So even if you start a new mount namespace, you cannot unmount stuff setup by the parent user namespace.
Please also setup the uid_map/gid_map for the unshared user namespace. even in container, user has rights to setup these two files.
# unshare --mount --user /bin/sh sh-4.2$ umount /sys/kernel/debug umount: /sys/kernel/debug: Invalid argument
in terminal one $ id uid=1000(gaofeng) gid=1000(gaofeng) groups=1000(gaofeng) $ ./unshare --mount --user /bin/sh sh-4.2$ echo $$ 17110 sh-4.2$
in other terminal,setup id map for new userns. $echo 0 1000 1 > /proc/17110/uid_map $echo 0 1000 1 > /proc/17110/gid_map
and then in terminal one sh-4.2$ umount -l /home/
Oh, hmm, forgot about the uid mapping. I thought the capabilities would be allowing me unmount regardless. Well, given that we're at rc2 now & I'm still unclear about how some aspects of the userns setup is working, I'm afraid we'll have to wait until 1.1.1 for the userns LXC code to merge. I'll aim todo it next week, so that we have plenty of time for further testing before the 1.1.1 release. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 06/28/2013 06:17 PM, Daniel P. Berrange wrote:
On Thu, Jun 27, 2013 at 08:56:25AM +0800, Gao feng wrote:
On 06/26/2013 07:01 PM, Daniel P. Berrange wrote:
On Wed, Jun 26, 2013 at 05:56:19PM +0800, Gao feng wrote:
On 06/26/2013 05:38 PM, Daniel P. Berrange wrote:
On Wed, Jun 26, 2013 at 10:26:10AM +0800, Gao feng wrote:
On 06/26/2013 04:39 AM, Daniel P. Berrange wrote: > On Thu, Jun 13, 2013 at 08:02:18PM +0200, Richard Weinberger wrote: >> Within a user namespace root can remount these filesysems at any >> time rw. >> Create these mappings only if we're not playing with user namespaces. > > This is a problem with the way we're initializing mounts in the > user namespace.
This problem exists even libvirt lxc doesn't support user namespace.
Yes, and this is a problem that user namespace is intended to solve.
> We need to ensure that the initial mounts setup > by libvirt can't be changed by admin inside the container. Preventing > the container admin from remounting or unmounting these mounts is key > to security. > > IIUC, the only way to ensure this is to start a new user namespace > /after/ setting up all mounts. >
start a new user namespace means the container will lose controller of mount namespace. so the container can't do mount operation too, though we only can mount a little of filesystems in un-init user namespace.
Merely being able to unmount is sufficient to exploit the host. Consider that the container was configured with the following mapping
/ -> / /export/mycontainer/home -> /home
Now, if the container admin can umount /home, then they can now see the home directory contents of the host. At least this is likely to be information leakage, and if any of the host home directories have UIDs that overlap with those assigned to the container ID map, you have a potentially exploitable situation.
Hence we need to ensure that the container cannot unmount or remount anything setup by libvirt. AFAICT, this means that all the mounts libvirt does, must be performed in a seprate user namespace to that wit hthe container will eventually run in.
Libvirt mounts something for the container in one user namesapce, and then libvirt calls unshare to create a new user namespace and start the init task of container.
Yes, the users in container can't do mount/unmount/remount on all of filesystem. but they can call unshare to create a new mount namespace, and they will have rights to mount/unmount/remount in this new created mount namespace. they can still umount /home to see the home directory contents of host.
An existing filesystem mount can only be remounted/unmounted by the (user ID, usernamespace) that originally mounted it. So even if you start a new mount namespace, you cannot unmount stuff setup by the parent user namespace.
Please also setup the uid_map/gid_map for the unshared user namespace. even in container, user has rights to setup these two files.
# unshare --mount --user /bin/sh sh-4.2$ umount /sys/kernel/debug umount: /sys/kernel/debug: Invalid argument
in terminal one $ id uid=1000(gaofeng) gid=1000(gaofeng) groups=1000(gaofeng) $ ./unshare --mount --user /bin/sh sh-4.2$ echo $$ 17110 sh-4.2$
in other terminal,setup id map for new userns. $echo 0 1000 1 > /proc/17110/uid_map $echo 0 1000 1 > /proc/17110/gid_map
and then in terminal one sh-4.2$ umount -l /home/
Oh, hmm, forgot about the uid mapping. I thought the capabilities would be allowing me unmount regardless.
Well, given that we're at rc2 now & I'm still unclear about how some aspects of the userns setup is working, I'm afraid we'll have to wait until 1.1.1 for the userns LXC code to merge. I'll aim todo it next week, so that we have plenty of time for further testing before the 1.1.1 release.
Ok, I think Richard had tested the userns support. Hi Richard, can you give me your ack or tested-by? Thanks!

Am 01.07.2013 04:26, schrieb Gao feng:
Well, given that we're at rc2 now & I'm still unclear about how some aspects of the userns setup is working, I'm afraid we'll have to wait until 1.1.1 for the userns LXC code to merge. I'll aim todo it next week, so that we have plenty of time for further testing before the 1.1.1 release.
Ok, I think Richard had tested the userns support. Hi Richard, can you give me your ack or tested-by?
I'm still facing one userns related issue. Create a container like this one: ---cut--- <domain type='lxc'> <name>testi</name> <memory>102400</memory> <os> <type>exe</type> <init>/bin/bash</init> </os> <idmap> <uid start='0' target='100000' count='100000'/> <gid start='0' target='100000' count='100000'/> </idmap> <devices> <console type='pty'/> <filesystem type='mount'> <source dir='/some/where/rootfs'/> <target dir='/'/> </filesystem> <interface type='network'> <source network='default'/> <mac address="52:54:00:be:49:be"/> </interface> </devices> </domain> ---cut--- After creating it attach to it's console, you'll find bash as pid 1. And you'll find that /proc/1/ is not fully uid/gid-mapped: ---cut--- # ls -la /proc/1/ total 0 dr-xr-xr-x 8 root root 0 Jul 1 06:06 . dr-xr-xr-x 74 nobody nogroup 0 Jul 1 06:06 .. dr-xr-xr-x 2 root root 0 Jul 1 06:06 attr -r-------- 1 nobody nogroup 0 Jul 1 06:06 auxv -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 cgroup --w------- 1 nobody nogroup 0 Jul 1 06:06 clear_refs -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 cmdline -rw-r--r-- 1 nobody nogroup 0 Jul 1 06:06 comm -rw-r--r-- 1 nobody nogroup 0 Jul 1 06:06 coredump_filter -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 cpuset lrwxrwxrwx 1 nobody nogroup 0 Jul 1 06:06 cwd -> / -r-------- 1 nobody nogroup 0 Jul 1 06:06 environ lrwxrwxrwx 1 nobody nogroup 0 Jul 1 06:06 exe -> /bin/bash dr-x------ 2 nobody nogroup 0 Jul 1 06:06 fd dr-x------ 2 nobody nogroup 0 Jul 1 06:06 fdinfo -rw-r--r-- 1 nobody nogroup 0 Jul 1 06:06 gid_map -r-------- 1 nobody nogroup 0 Jul 1 06:06 io -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 limits -rw-r--r-- 1 nobody nogroup 0 Jul 1 06:06 loginuid -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 maps -rw------- 1 nobody nogroup 0 Jul 1 06:06 mem -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 mountinfo -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 mounts -r-------- 1 nobody nogroup 0 Jul 1 06:06 mountstats dr-xr-xr-x 10 root root 0 Jul 1 06:06 net dr-x--x--x 2 nobody nogroup 0 Jul 1 06:06 ns -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 numa_maps -rw-r--r-- 1 nobody nogroup 0 Jul 1 06:06 oom_adj -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 oom_score -rw-r--r-- 1 nobody nogroup 0 Jul 1 06:06 oom_score_adj -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 pagemap -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 personality -rw-r--r-- 1 nobody nogroup 0 Jul 1 06:06 projid_map lrwxrwxrwx 1 nobody nogroup 0 Jul 1 06:06 root -> / -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 schedstat -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 sessionid -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 smaps -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 stack -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 stat -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 statm -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 status -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 syscall dr-xr-xr-x 3 root root 0 Jul 1 06:06 task -rw-r--r-- 1 nobody nogroup 0 Jul 1 06:06 uid_map -r--r--r-- 1 nobody nogroup 0 Jul 1 06:06 wchan ---cut--- Systemd suffers from this issue because it needs to read from /proc/1/environ. After one exec /proc seems to be fixed: ---cut--- # cat /proc/1/environ cat: /proc/1/environ: Permission denied # exec /bin/bash # cat /proc/1/environ TERM=linuxPATH=/bin:/sbinPWD=/container_uuid=fabc42f8-cdee-461c-9a21-93902ab52b40SHLVL=0LIBVIRT_LXC_UUID=fabc42f8-cdee-461c-9a21-93902ab52b40LIBVIRT_LXC_NAME=testicontainer=lxc-libvirt ---cut--- If I turn lxcContainerDropCapabilities() into a NOP the permissions in /proc are no longer clobbered. Another (maybe related issue), No capabilities seem to get dropped. (Of course tested where lxcContainerDropCapabilities() is not a NOP :) ) ---cut--- # /usr/bin/pscap -a ppid pid name command capabilities 0 1 root bash full ---cut--- Any ideas what's going on here? Thanks, //richard

On Mon, Jul 01, 2013 at 08:29:14AM +0200, Richard Weinberger wrote:
Am 01.07.2013 04:26, schrieb Gao feng:
Well, given that we're at rc2 now & I'm still unclear about how some aspects of the userns setup is working, I'm afraid we'll have to wait until 1.1.1 for the userns LXC code to merge. I'll aim todo it next week, so that we have plenty of time for further testing before the 1.1.1 release.
Ok, I think Richard had tested the userns support. Hi Richard, can you give me your ack or tested-by?
I'm still facing one userns related issue.
[snip]
After creating it attach to it's console, you'll find bash as pid 1. And you'll find that /proc/1/ is not fully uid/gid-mapped: ---cut--- # ls -la /proc/1/ total 0 dr-xr-xr-x 8 root root 0 Jul 1 06:06 . dr-xr-xr-x 74 nobody nogroup 0 Jul 1 06:06 .. dr-xr-xr-x 2 root root 0 Jul 1 06:06 attr
[snip]
Any ideas what's going on here?
No, it is very odd. It smells like a kernel issue to me. What version are you running ? I've also tried running the demo programs shown on the LWN.net article https://lwn.net/Articles/532593/ and they don't operate in the way described by the article - the demo programs continue to ru as 'nfsnobody' even after the mappings are setup. I'm just using the Fedora 3.9.4-303 kernel, rebuilt with userns enabled in KConfig. I'm wondering if there is still stuff missing in 3.9.x that prevents this from working properly, or if the kernel behaviour changed after those LWN articles were written. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Am 01.07.2013 12:33, schrieb Daniel P. Berrange:
On Mon, Jul 01, 2013 at 08:29:14AM +0200, Richard Weinberger wrote:
Am 01.07.2013 04:26, schrieb Gao feng:
Well, given that we're at rc2 now & I'm still unclear about how some aspects of the userns setup is working, I'm afraid we'll have to wait until 1.1.1 for the userns LXC code to merge. I'll aim todo it next week, so that we have plenty of time for further testing before the 1.1.1 release.
Ok, I think Richard had tested the userns support. Hi Richard, can you give me your ack or tested-by?
I'm still facing one userns related issue.
[snip]
After creating it attach to it's console, you'll find bash as pid 1. And you'll find that /proc/1/ is not fully uid/gid-mapped: ---cut--- # ls -la /proc/1/ total 0 dr-xr-xr-x 8 root root 0 Jul 1 06:06 . dr-xr-xr-x 74 nobody nogroup 0 Jul 1 06:06 .. dr-xr-xr-x 2 root root 0 Jul 1 06:06 attr
[snip]
Any ideas what's going on here?
No, it is very odd. It smells like a kernel issue to me. What version are you running ?
I see this issue on all kernels. Currently I'm using vanilla v3.9.x and v3.10.
I've also tried running the demo programs shown on the LWN.net article
https://lwn.net/Articles/532593/
and they don't operate in the way described by the article - the demo programs continue to ru as 'nfsnobody' even after the mappings are setup.
I'm just using the Fedora 3.9.4-303 kernel, rebuilt with userns enabled in KConfig. I'm wondering if there is still stuff missing in 3.9.x that prevents this from working properly, or if the kernel behaviour changed after those LWN articles were written.
To me it looks like the capability system behaves odd. The mappings in /proc are fine as long I do not call capng_updatev(). Also calling capng_updatev() with parameters that do not change the current cap set triggers the odd behavior too. So we see two (related?) issues: 1. If we try updating the capabilities of pid1 /proc/1/ has unmapped files till we exec(). 2. Dropping capabilities does not work we always gain a fresh and full capability set. BTW: I'm sure the issues are not caused by Gau Feng's userns patches. Feel free to add: Acked-by: Richard Weinberger <richard@nod.at> Tested-by: Richard Weinberger <richard@nod.at> Thanks, //richard

On Mon, Jul 01, 2013 at 01:05:23PM +0200, Richard Weinberger wrote:
Am 01.07.2013 12:33, schrieb Daniel P. Berrange:
On Mon, Jul 01, 2013 at 08:29:14AM +0200, Richard Weinberger wrote:
Any ideas what's going on here?
No, it is very odd. It smells like a kernel issue to me. What version are you running ?
I see this issue on all kernels. Currently I'm using vanilla v3.9.x and v3.10.
I've also tried running the demo programs shown on the LWN.net article
https://lwn.net/Articles/532593/
and they don't operate in the way described by the article - the demo programs continue to ru as 'nfsnobody' even after the mappings are setup.
I'm just using the Fedora 3.9.4-303 kernel, rebuilt with userns enabled in KConfig. I'm wondering if there is still stuff missing in 3.9.x that prevents this from working properly, or if the kernel behaviour changed after those LWN articles were written.
To me it looks like the capability system behaves odd. The mappings in /proc are fine as long I do not call capng_updatev(). Also calling capng_updatev() with parameters that do not change the current cap set triggers the odd behavior too.
So we see two (related?) issues: 1. If we try updating the capabilities of pid1 /proc/1/ has unmapped files till we exec(). 2. Dropping capabilities does not work we always gain a fresh and full capability set.
BTW: I'm sure the issues are not caused by Gau Feng's userns patches.
Yeah, I've reproduced this problem with standalone code outside of libvirt. Take the attached code and run # gcc -Wall -o userns_child_exec userns_child_exec.c # ./userns_child_exec -U -M '0 1000 1' -G '0 1000 8' bash Launching child init # id uid=0(root) gid=7(lp) groups=0(root),65534(nfsnobody) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 # ls -al /proc/1/environ -r--------. 1 nfsnobody nfsnobody 0 Jul 1 12:14 /proc/1/environ # cat /proc/1/environ cat: /proc/1/environ: Permission denied and this demo program isn't attempting to touch capabilties at all. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Am 01.07.2013 13:22, schrieb Daniel P. Berrange:
On Mon, Jul 01, 2013 at 01:05:23PM +0200, Richard Weinberger wrote:
Am 01.07.2013 12:33, schrieb Daniel P. Berrange:
On Mon, Jul 01, 2013 at 08:29:14AM +0200, Richard Weinberger wrote:
Any ideas what's going on here?
No, it is very odd. It smells like a kernel issue to me. What version are you running ?
I see this issue on all kernels. Currently I'm using vanilla v3.9.x and v3.10.
I've also tried running the demo programs shown on the LWN.net article
https://lwn.net/Articles/532593/
and they don't operate in the way described by the article - the demo programs continue to ru as 'nfsnobody' even after the mappings are setup.
I'm just using the Fedora 3.9.4-303 kernel, rebuilt with userns enabled in KConfig. I'm wondering if there is still stuff missing in 3.9.x that prevents this from working properly, or if the kernel behaviour changed after those LWN articles were written.
To me it looks like the capability system behaves odd. The mappings in /proc are fine as long I do not call capng_updatev(). Also calling capng_updatev() with parameters that do not change the current cap set triggers the odd behavior too.
So we see two (related?) issues: 1. If we try updating the capabilities of pid1 /proc/1/ has unmapped files till we exec(). 2. Dropping capabilities does not work we always gain a fresh and full capability set.
BTW: I'm sure the issues are not caused by Gau Feng's userns patches.
Yeah, I've reproduced this problem with standalone code outside of libvirt.
Take the attached code and run
-ENOATTACHMENT :-( Thanks, //richard

On Mon, Jul 01, 2013 at 01:25:28PM +0200, Richard Weinberger wrote:
Am 01.07.2013 13:22, schrieb Daniel P. Berrange:
On Mon, Jul 01, 2013 at 01:05:23PM +0200, Richard Weinberger wrote:
Am 01.07.2013 12:33, schrieb Daniel P. Berrange:
On Mon, Jul 01, 2013 at 08:29:14AM +0200, Richard Weinberger wrote:
Any ideas what's going on here?
No, it is very odd. It smells like a kernel issue to me. What version are you running ?
I see this issue on all kernels. Currently I'm using vanilla v3.9.x and v3.10.
I've also tried running the demo programs shown on the LWN.net article
https://lwn.net/Articles/532593/
and they don't operate in the way described by the article - the demo programs continue to ru as 'nfsnobody' even after the mappings are setup.
I'm just using the Fedora 3.9.4-303 kernel, rebuilt with userns enabled in KConfig. I'm wondering if there is still stuff missing in 3.9.x that prevents this from working properly, or if the kernel behaviour changed after those LWN articles were written.
To me it looks like the capability system behaves odd. The mappings in /proc are fine as long I do not call capng_updatev(). Also calling capng_updatev() with parameters that do not change the current cap set triggers the odd behavior too.
So we see two (related?) issues: 1. If we try updating the capabilities of pid1 /proc/1/ has unmapped files till we exec(). 2. Dropping capabilities does not work we always gain a fresh and full capability set.
BTW: I'm sure the issues are not caused by Gau Feng's userns patches.
Yeah, I've reproduced this problem with standalone code outside of libvirt.
Take the attached code and run
-ENOATTACHMENT :-(
Now really attached. I think I might know what is happening now though. When you start a new namespace, you must mount a new instance of 'proc' filesystem. We are not synchronizing this wrt setup of the uid/gid mappings though, so we are racy. So I have a feeling we're creating the proc filesystem before the mappings are setup. I'm going to add some synchronization in to see if it makes a difference in this respect. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Am 01.07.2013 13:35, schrieb Daniel P. Berrange:
On Mon, Jul 01, 2013 at 01:25:28PM +0200, Richard Weinberger wrote:
Am 01.07.2013 13:22, schrieb Daniel P. Berrange:
On Mon, Jul 01, 2013 at 01:05:23PM +0200, Richard Weinberger wrote:
Am 01.07.2013 12:33, schrieb Daniel P. Berrange:
On Mon, Jul 01, 2013 at 08:29:14AM +0200, Richard Weinberger wrote:
Any ideas what's going on here?
No, it is very odd. It smells like a kernel issue to me. What version are you running ?
I see this issue on all kernels. Currently I'm using vanilla v3.9.x and v3.10.
I've also tried running the demo programs shown on the LWN.net article
https://lwn.net/Articles/532593/
and they don't operate in the way described by the article - the demo programs continue to ru as 'nfsnobody' even after the mappings are setup.
I'm just using the Fedora 3.9.4-303 kernel, rebuilt with userns enabled in KConfig. I'm wondering if there is still stuff missing in 3.9.x that prevents this from working properly, or if the kernel behaviour changed after those LWN articles were written.
To me it looks like the capability system behaves odd. The mappings in /proc are fine as long I do not call capng_updatev(). Also calling capng_updatev() with parameters that do not change the current cap set triggers the odd behavior too.
So we see two (related?) issues: 1. If we try updating the capabilities of pid1 /proc/1/ has unmapped files till we exec(). 2. Dropping capabilities does not work we always gain a fresh and full capability set.
BTW: I'm sure the issues are not caused by Gau Feng's userns patches.
Yeah, I've reproduced this problem with standalone code outside of libvirt.
Take the attached code and run
-ENOATTACHMENT :-(
Now really attached.
I think I might know what is happening now though. When you start a new namespace, you must mount a new instance of 'proc' filesystem. We are not synchronizing this wrt setup of the uid/gid mappings though, so we are racy. So I have a feeling we're creating the proc filesystem before the mappings are setup. I'm going to add some synchronization in to see if it makes a difference in this respect.
So you mount /proc and write the uid/gid mappings in parallel? Both has to be done on the host side. Why is this parallel? Thanks, //richard

Am 01.07.2013 13:44, schrieb Richard Weinberger:
Am 01.07.2013 13:35, schrieb Daniel P. Berrange:
On Mon, Jul 01, 2013 at 01:25:28PM +0200, Richard Weinberger wrote:
Am 01.07.2013 13:22, schrieb Daniel P. Berrange:
On Mon, Jul 01, 2013 at 01:05:23PM +0200, Richard Weinberger wrote:
Am 01.07.2013 12:33, schrieb Daniel P. Berrange:
On Mon, Jul 01, 2013 at 08:29:14AM +0200, Richard Weinberger wrote: > Any ideas what's going on here?
No, it is very odd. It smells like a kernel issue to me. What version are you running ?
I see this issue on all kernels. Currently I'm using vanilla v3.9.x and v3.10.
I've also tried running the demo programs shown on the LWN.net article
https://lwn.net/Articles/532593/
and they don't operate in the way described by the article - the demo programs continue to ru as 'nfsnobody' even after the mappings are setup.
I'm just using the Fedora 3.9.4-303 kernel, rebuilt with userns enabled in KConfig. I'm wondering if there is still stuff missing in 3.9.x that prevents this from working properly, or if the kernel behaviour changed after those LWN articles were written.
To me it looks like the capability system behaves odd. The mappings in /proc are fine as long I do not call capng_updatev(). Also calling capng_updatev() with parameters that do not change the current cap set triggers the odd behavior too.
So we see two (related?) issues: 1. If we try updating the capabilities of pid1 /proc/1/ has unmapped files till we exec(). 2. Dropping capabilities does not work we always gain a fresh and full capability set.
BTW: I'm sure the issues are not caused by Gau Feng's userns patches.
Yeah, I've reproduced this problem with standalone code outside of libvirt.
Take the attached code and run
-ENOATTACHMENT :-(
Now really attached.
I think I might know what is happening now though. When you start a new namespace, you must mount a new instance of 'proc' filesystem. We are not synchronizing this wrt setup of the uid/gid mappings though, so we are racy. So I have a feeling we're creating the proc filesystem before the mappings are setup. I'm going to add some synchronization in to see if it makes a difference in this respect.
So you mount /proc and write the uid/gid mappings in parallel? Both has to be done on the host side. Why is this parallel?
Forget this one... :D Thanks, //richard

On 07/01/2013 07:05 PM, Richard Weinberger wrote:
Am 01.07.2013 12:33, schrieb Daniel P. Berrange:
On Mon, Jul 01, 2013 at 08:29:14AM +0200, Richard Weinberger wrote:
Am 01.07.2013 04:26, schrieb Gao feng:
Well, given that we're at rc2 now & I'm still unclear about how some aspects of the userns setup is working, I'm afraid we'll have to wait until 1.1.1 for the userns LXC code to merge. I'll aim todo it next week, so that we have plenty of time for further testing before the 1.1.1 release.
Ok, I think Richard had tested the userns support. Hi Richard, can you give me your ack or tested-by?
I'm still facing one userns related issue.
[snip]
After creating it attach to it's console, you'll find bash as pid 1. And you'll find that /proc/1/ is not fully uid/gid-mapped: ---cut--- # ls -la /proc/1/ total 0 dr-xr-xr-x 8 root root 0 Jul 1 06:06 . dr-xr-xr-x 74 nobody nogroup 0 Jul 1 06:06 .. dr-xr-xr-x 2 root root 0 Jul 1 06:06 attr
[snip]
Any ideas what's going on here?
No, it is very odd. It smells like a kernel issue to me. What version are you running ?
I see this issue on all kernels. Currently I'm using vanilla v3.9.x and v3.10.
I've also tried running the demo programs shown on the LWN.net article
https://lwn.net/Articles/532593/
and they don't operate in the way described by the article - the demo programs continue to ru as 'nfsnobody' even after the mappings are setup.
I'm just using the Fedora 3.9.4-303 kernel, rebuilt with userns enabled in KConfig. I'm wondering if there is still stuff missing in 3.9.x that prevents this from working properly, or if the kernel behaviour changed after those LWN articles were written.
To me it looks like the capability system behaves odd. The mappings in /proc are fine as long I do not call capng_updatev(). Also calling capng_updatev() with parameters that do not change the current cap set triggers the odd behavior too.
This issue is occured after we call setuid, the init task of container is set to be un-dumpable after setuid. I don't know why, the kernel set the owner of /proc/<pid>/* to root user of host when the task is un-dumpable.
So we see two (related?) issues: 1. If we try updating the capabilities of pid1 /proc/1/ has unmapped files till we exec(). 2. Dropping capabilities does not work we always gain a fresh and full capability set.
This problem disappeared after 1, remove capabilities dropping 2, call prctl(PR_SET_DUMPABLE, 1) after setuid/gid.
BTW: I'm sure the issues are not caused by Gau Feng's userns patches.
I think this more like a kernel bug. we should set the owner of /proc/<pid>/* to the root user of container not the host.
Feel free to add: Acked-by: Richard Weinberger <richard@nod.at> Tested-by: Richard Weinberger <richard@nod.at>
Thanks for your help! Gao

On 07/01/2013 07:57 PM, Gao feng wrote:
On 07/01/2013 07:05 PM, Richard Weinberger wrote:
Am 01.07.2013 12:33, schrieb Daniel P. Berrange:
On Mon, Jul 01, 2013 at 08:29:14AM +0200, Richard Weinberger wrote:
Am 01.07.2013 04:26, schrieb Gao feng:
Well, given that we're at rc2 now & I'm still unclear about how some aspects of the userns setup is working, I'm afraid we'll have to wait until 1.1.1 for the userns LXC code to merge. I'll aim todo it next week, so that we have plenty of time for further testing before the 1.1.1 release.
Ok, I think Richard had tested the userns support. Hi Richard, can you give me your ack or tested-by?
I'm still facing one userns related issue.
[snip]
After creating it attach to it's console, you'll find bash as pid 1. And you'll find that /proc/1/ is not fully uid/gid-mapped: ---cut--- # ls -la /proc/1/ total 0 dr-xr-xr-x 8 root root 0 Jul 1 06:06 . dr-xr-xr-x 74 nobody nogroup 0 Jul 1 06:06 .. dr-xr-xr-x 2 root root 0 Jul 1 06:06 attr
[snip]
Any ideas what's going on here?
No, it is very odd. It smells like a kernel issue to me. What version are you running ?
I see this issue on all kernels. Currently I'm using vanilla v3.9.x and v3.10.
I've also tried running the demo programs shown on the LWN.net article
https://lwn.net/Articles/532593/
and they don't operate in the way described by the article - the demo programs continue to ru as 'nfsnobody' even after the mappings are setup.
I'm just using the Fedora 3.9.4-303 kernel, rebuilt with userns enabled in KConfig. I'm wondering if there is still stuff missing in 3.9.x that prevents this from working properly, or if the kernel behaviour changed after those LWN articles were written.
To me it looks like the capability system behaves odd. The mappings in /proc are fine as long I do not call capng_updatev(). Also calling capng_updatev() with parameters that do not change the current cap set triggers the odd behavior too.
This issue is occured after we call setuid, the init task of container is set to be un-dumpable after setuid. I don't know why, the kernel set the owner of /proc/<pid>/* to root user of host when the task is un-dumpable.
So we see two (related?) issues: 1. If we try updating the capabilities of pid1 /proc/1/ has unmapped files till we exec(). 2. Dropping capabilities does not work we always gain a fresh and full capability set.
This problem disappeared after 1, remove capabilities dropping 2, call prctl(PR_SET_DUMPABLE, 1) after setuid/gid.
BTW: I'm sure the issues are not caused by Gau Feng's userns patches.
I think this more like a kernel bug. we should set the owner of /proc/<pid>/* to the root user of container not the host.
You can try the program attached, the owner of /proc/<pid of this program>/* is incorrect too. Hmm, it's better to fix this problem in kernel. it's most like a userns bug. Thanks

On 06/26/2013 07:01 PM, Daniel P. Berrange wrote:
On Wed, Jun 26, 2013 at 05:56:19PM +0800, Gao feng wrote:
On 06/26/2013 05:38 PM, Daniel P. Berrange wrote:
On Wed, Jun 26, 2013 at 10:26:10AM +0800, Gao feng wrote:
On 06/26/2013 04:39 AM, Daniel P. Berrange wrote:
On Thu, Jun 13, 2013 at 08:02:18PM +0200, Richard Weinberger wrote:
Within a user namespace root can remount these filesysems at any time rw. Create these mappings only if we're not playing with user namespaces.
This is a problem with the way we're initializing mounts in the user namespace.
This problem exists even libvirt lxc doesn't support user namespace.
Yes, and this is a problem that user namespace is intended to solve.
We need to ensure that the initial mounts setup by libvirt can't be changed by admin inside the container. Preventing the container admin from remounting or unmounting these mounts is key to security.
IIUC, the only way to ensure this is to start a new user namespace /after/ setting up all mounts.
start a new user namespace means the container will lose controller of mount namespace. so the container can't do mount operation too, though we only can mount a little of filesystems in un-init user namespace.
Merely being able to unmount is sufficient to exploit the host. Consider that the container was configured with the following mapping
/ -> / /export/mycontainer/home -> /home
Now, if the container admin can umount /home, then they can now see the home directory contents of the host. At least this is likely to be information leakage, and if any of the host home directories have UIDs that overlap with those assigned to the container ID map, you have a potentially exploitable situation.
Hence we need to ensure that the container cannot unmount or remount anything setup by libvirt. AFAICT, this means that all the mounts libvirt does, must be performed in a seprate user namespace to that wit hthe container will eventually run in.
Libvirt mounts something for the container in one user namesapce, and then libvirt calls unshare to create a new user namespace and start the init task of container.
Yes, the users in container can't do mount/unmount/remount on all of filesystem. but they can call unshare to create a new mount namespace, and they will have rights to mount/unmount/remount in this new created mount namespace. they can still umount /home to see the home directory contents of host.
An existing filesystem mount can only be remounted/unmounted by the (user ID, usernamespace) that originally mounted it. So even if you start a new mount namespace, you cannot unmount stuff setup by the parent user namespace.
Yes, we cannot unmount it, but the mount informations will be inherited by child mount namespace. and the new created userns has rights to operate the child mount namespace. It's what we allow.
# unshare --mount --user /bin/sh sh-4.2$ umount /sys/kernel/debug umount: /sys/kernel/debug: Invalid argument
Regards, Daniel

On 06/14/2013 02:02 AM, Richard Weinberger wrote:
Within a user namespace root can remount these filesysems at any time rw. Create these mappings only if we're not playing with user namespaces.
Without user namespace,the root user of container can remount all of the filesystem too, since he is the root user of host. The reason we can allow filesystem to be mounted as writable is that with user namespace we can make sure the root user in container has no rights to change some sysfs/sysctl configuration that we don't want him to change.
Signed-off-by: Richard Weinberger <richard@nod.at> --- src/lxc/lxc_container.c | 42 +++++++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 19 deletions(-)
diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 4f00420..a003ec8 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -682,8 +682,17 @@ err: return ret; }
+static int userns_supported(void) +{ + return lxcContainerAvailable(LXC_CONTAINER_FEATURE_USER) == 0; +}
-static int lxcContainerMountBasicFS(void) +static int userns_required(virDomainDefPtr def) +{ + return def->idmap.uidmap && def->idmap.gidmap; +} + +static int lxcContainerMountBasicFS(virDomainDefPtr vmDef) { const struct { const char *src; @@ -691,6 +700,7 @@ static int lxcContainerMountBasicFS(void) const char *type; const char *opts; int mflags; + bool paranoia; } mnts[] = { /* When we want to make a bind mount readonly, for unknown reasons, * it is currently necessary to bind it once, and then remount the @@ -698,14 +708,14 @@ static int lxcContainerMountBasicFS(void) * mount point in the main OS becomes readonly too which is not what * we want. Hence some things have two entries here. */ - { "proc", "/proc", "proc", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV }, - { "/proc/sys", "/proc/sys", NULL, NULL, MS_BIND }, - { "/proc/sys", "/proc/sys", NULL, NULL, MS_BIND|MS_REMOUNT|MS_RDONLY }, - { "sysfs", "/sys", "sysfs", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV }, - { "sysfs", "/sys", "sysfs", NULL, MS_BIND|MS_REMOUNT|MS_RDONLY }, + { "proc", "/proc", "proc", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV, false }, + { "/proc/sys", "/proc/sys", NULL, NULL, MS_BIND, true }, + { "/proc/sys", "/proc/sys", NULL, NULL, MS_BIND|MS_REMOUNT|MS_RDONLY, true }, + { "sysfs", "/sys", "sysfs", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV, false }, + { "sysfs", "/sys", "sysfs", NULL, MS_BIND|MS_REMOUNT|MS_RDONLY, true }, #if WITH_SELINUX - { SELINUX_MOUNT, SELINUX_MOUNT, "selinuxfs", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV }, - { SELINUX_MOUNT, SELINUX_MOUNT, NULL, NULL, MS_BIND|MS_REMOUNT|MS_RDONLY }, + { SELINUX_MOUNT, SELINUX_MOUNT, "selinuxfs", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV, false }, + { SELINUX_MOUNT, SELINUX_MOUNT, NULL, NULL, MS_BIND|MS_REMOUNT|MS_RDONLY, true }, #endif }; int i, rc = -1; @@ -720,6 +730,10 @@ static int lxcContainerMountBasicFS(void)
srcpath = mnts[i].src;
+ /* Skip ro overlay mounts if we build a userns as root can remount it rw at any time */ + if (userns_required(vmDef) && mnts[i].paranoia) + continue; + /* Skip if mount doesn't exist in source */ if ((srcpath[0] == '/') && (access(srcpath, R_OK) < 0)) @@ -1780,7 +1794,7 @@ static int lxcContainerSetupPivotRoot(virDomainDefPtr vmDef, goto cleanup;
/* Mounts the core /proc, /sys, etc filesystems */ - if (lxcContainerMountBasicFS() < 0) + if (lxcContainerMountBasicFS(vmDef) < 0) goto cleanup;
/* Mounts /proc/meminfo etc sysinfo */ @@ -1896,16 +1910,6 @@ static int lxcContainerDropCapabilities(bool keepReboot ATTRIBUTE_UNUSED) return 0; }
-static int userns_supported(void) -{ - return lxcContainerAvailable(LXC_CONTAINER_FEATURE_USER) == 0; -} - -static int userns_required(virDomainDefPtr def) -{ - return def->idmap.uidmap && def->idmap.gidmap; -} - /** * lxcContainerChild: * @data: pointer to container arguments

Am 13.06.2013 20:02, schrieb Richard Weinberger:
Dropping capabilities within a user namespace makes no sense because any uid 0 process will regain all caps upon execve().
Signed-off-by: Richard Weinberger <richard@nod.at>
BTW: This one solves also a funny systemd issue. systemd reads from /proc/1/environ to detect whether it runs with in LXC or not. If we change the capability set (it does not matter which cap we drop), uid 0/pid 1 is no longer allowed to read from that file. If have to admit that I don't fully understand what kind of user namespace/capability horror is going on. (Currently reading kernel sources to find out.) But if pid 1 execve's anything else it regains fresh capability set and is allowed to read /proc/1/environ. This is way <init>/sbin/init</init> did not work for me. If I use a simply bash wrapper as init which execve's systemd it works fine... Thanks, //richard

Am 13.06.2013 20:02, schrieb Richard Weinberger:
Dropping capabilities within a user namespace makes no sense because any uid 0 process will regain all caps upon execve().
Signed-off-by: Richard Weinberger <richard@nod.at> --- src/lxc/lxc_container.c | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-)
diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 958e20d..4f00420 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -1896,6 +1896,15 @@ static int lxcContainerDropCapabilities(bool keepReboot ATTRIBUTE_UNUSED) return 0; }
+static int userns_supported(void) +{ + return lxcContainerAvailable(LXC_CONTAINER_FEATURE_USER) == 0; +} + +static int userns_required(virDomainDefPtr def) +{ + return def->idmap.uidmap && def->idmap.gidmap; +}
/** * lxcContainerChild: @@ -1992,7 +2001,7 @@ static int lxcContainerChild(void *data) }
/* drop a set of root capabilities */ - if (lxcContainerDropCapabilities(!!hasReboot) < 0) + if (!userns_required(vmDef) && lxcContainerDropCapabilities(!!hasReboot) < 0) goto cleanup;
if (lxcContainerSendContinue(argv->handshakefd) < 0) { @@ -2025,16 +2034,6 @@ cleanup: return ret; }
-static int userns_supported(void) -{ - return lxcContainerAvailable(LXC_CONTAINER_FEATURE_USER) == 0; -} - -static int userns_required(virDomainDefPtr def) -{ - return def->idmap.uidmap && def->idmap.gidmap; -} - virArch lxcContainerGetAlt32bitArch(virArch arch) { /* Any Linux 64bit arch which has a 32bit
Any feedback on that one? Thanks, //richard

On Tue, Jun 25, 2013 at 09:47:13AM +0200, Richard Weinberger wrote:
Am 13.06.2013 20:02, schrieb Richard Weinberger:
Dropping capabilities within a user namespace makes no sense because any uid 0 process will regain all caps upon execve().
Signed-off-by: Richard Weinberger <richard@nod.at> --- src/lxc/lxc_container.c | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-)
diff --git a/src/lxc/lxc_container.c b/src/lxc/lxc_container.c index 958e20d..4f00420 100644 --- a/src/lxc/lxc_container.c +++ b/src/lxc/lxc_container.c @@ -1896,6 +1896,15 @@ static int lxcContainerDropCapabilities(bool keepReboot ATTRIBUTE_UNUSED) return 0; }
+static int userns_supported(void) +{ + return lxcContainerAvailable(LXC_CONTAINER_FEATURE_USER) == 0; +} + +static int userns_required(virDomainDefPtr def) +{ + return def->idmap.uidmap && def->idmap.gidmap; +}
/** * lxcContainerChild: @@ -1992,7 +2001,7 @@ static int lxcContainerChild(void *data) }
/* drop a set of root capabilities */ - if (lxcContainerDropCapabilities(!!hasReboot) < 0) + if (!userns_required(vmDef) && lxcContainerDropCapabilities(!!hasReboot) < 0) goto cleanup;
if (lxcContainerSendContinue(argv->handshakefd) < 0) { @@ -2025,16 +2034,6 @@ cleanup: return ret; }
-static int userns_supported(void) -{ - return lxcContainerAvailable(LXC_CONTAINER_FEATURE_USER) == 0; -} - -static int userns_required(virDomainDefPtr def) -{ - return def->idmap.uidmap && def->idmap.gidmap; -} - virArch lxcContainerGetAlt32bitArch(virArch arch) { /* Any Linux 64bit arch which has a 32bit
Any feedback on that one?
I've been away on PTO for 2 weeks, so LXC review/merge got delayed. I'm looking to get the basic userns stuff merged first, for this release, then i'll look at followup patches to see what we need for this release vs next. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Thu, Jun 13, 2013 at 08:02:17PM +0200, Richard Weinberger wrote:
Dropping capabilities within a user namespace makes no sense because any uid 0 process will regain all caps upon execve().
That is true, except for the fact that libvirt has removed the capabilities from the bounding set too. This prevents them being regained upon execve. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Am 25.06.2013 22:36, schrieb Daniel P. Berrange:
On Thu, Jun 13, 2013 at 08:02:17PM +0200, Richard Weinberger wrote:
Dropping capabilities within a user namespace makes no sense because any uid 0 process will regain all caps upon execve().
That is true, except for the fact that libvirt has removed the capabilities from the bounding set too. This prevents them being regained upon execve.
Are you sure that this applies also for user namespaces? Thanks, //richard

On Tue, Jun 25, 2013 at 11:52:58PM +0200, Richard Weinberger wrote:
Am 25.06.2013 22:36, schrieb Daniel P. Berrange:
On Thu, Jun 13, 2013 at 08:02:17PM +0200, Richard Weinberger wrote:
Dropping capabilities within a user namespace makes no sense because any uid 0 process will regain all caps upon execve().
That is true, except for the fact that libvirt has removed the capabilities from the bounding set too. This prevents them being regained upon execve.
Are you sure that this applies also for user namespaces?
The only thing that namespaces changes it that When you clone() with CLONE_NEWUSER set, the child procss will get initialized with the full set of capabilities, regardless of what the parent had. Thereafter all the normal rules about manipulation of capabilities apply, including the bounding set. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (3)
-
Daniel P. Berrange
-
Gao feng
-
Richard Weinberger