[libvirt-users] Libvirt-LXC + systemd + user namespace

Hi there! I am trying to turn on user namespace by adding following lines to the config: <idmap> <uid start='0' target='0' count='100000'/> <gid start='0' target='0' count='100000'/> </idmap> As you can see the root in container is mapped to the root outside. I was expected to see no difference after adding this lines, but unfortunately there are some (see details below). Am I missing something or is there a problem with system, libvirt or kernel? Full libvirt config: <domain type='lxc'> <name>test_with_idmap</name> <memory>102400</memory> <os> <type>exe</type> <init>/usr/lib/systemd/systemd</init> </os> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>destroy</on_crash> <idmap> <uid start='0' target='0' count='100000'/> <gid start='0' target='0' count='100000'/> </idmap> <devices> <console type='pty'/> <filesystem type='mount'> <source dir='/guest'/> <target dir='/'/> </filesystem> </devices> </domain> root:~> uname -a Linux localhost 3.10.19-01077-g4a19d28-dirty #5 SMP PREEMPT Mon Jan 13 12:56:09 CET 2014 armv7l GNU/Linux root:~> libvirtd --version libvirtd (libvirt) 1.2.1 root:~> systemd --version systemd 204 After adding idmap to config systemd can't start many of its services, in particular: Failed to mount Debug File System. Failed to mount Configuration File System. Failed to mount FUSE Control File System. Failed to start udev Kernel Device Manager. Failed to start Remount Root and Kernel File Systems. Failed to start Journal Service. systemctl status says: ExecMount=/bin/mount debugfs /sys/kernel/debug -t debugfs (code=exited, status=32) ExecMount=/bin/mount configfs /sys/kernel/config -t configfs (code=exited, status=32) ExecMount=/bin/mount fusectl /sys/fs/fuse/connections -t fusectl (code=exited, status=32) ExecStart=/usr/lib/systemd/systemd-udevd (code=exited,status=206/OOM_ADJUST) ExecStart=/usr/lib/systemd/systemd-remount-fs (code=exited,status=1/FAILURE) ExecStart=/usr/lib/systemd/systemd-journald (code=exited, status=218/CAPABILITIES) Thanks!

On Tue, Jan 28, 2014 at 12:32:41PM +0100, Jan Olszak wrote:
Hi there!
I am trying to turn on user namespace by adding following lines to the config:
<idmap>
<uid start='0' target='0' count='100000'/>
<gid start='0' target='0' count='100000'/>
</idmap>
As you can see the root in container is mapped to the root outside. I was expected to see no difference after adding this lines, but unfortunately there are some (see details below).
Am I missing something or is there a problem with system, libvirt or kernel?
I've not had any chance to try LXC + user namespaces + systemd yet, but based on the list of things which fail, it seems like it might not be detecting that it is inside a container. Seems almost like it has still got the CAP_MKNOD permission and so is strying to start things it should not have like udev, and various filesystems. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 28.01.2014 12:46, Daniel P. Berrange wrote:
Hi there!
I am trying to turn on user namespace by adding following lines to the config:
<idmap>
<uid start='0' target='0' count='100000'/>
<gid start='0' target='0' count='100000'/>
</idmap>
As you can see the root in container is mapped to the root outside. I was expected to see no difference after adding this lines, but unfortunately there are some (see details below).
Am I missing something or is there a problem with system, libvirt or kernel? I've not had any chance to try LXC + user namespaces + systemd yet, but
On Tue, Jan 28, 2014 at 12:32:41PM +0100, Jan Olszak wrote: based on the list of things which fail, it seems like it might not be detecting that it is inside a container. Seems almost like it has still got the CAP_MKNOD permission and so is strying to start things it should not have like udev, and various filesystems.
Daniel
I was able to reduce the problem by not using libvirt nor systemd. I've created a bash process inside user namespace with mapping root_inside<->root_outside. I've used a program from https://lwn.net/Articles/532593/ : ./userns_child_exec -U -M '0 0 1' -G '0 0 1' bash This program simply calls clone with CLONE_NEWUSER flag and set proper uid_map and gid_map. The test commands are as follows: mkdir /test mount debugfs /test -t debugfs and strace shows: mount("debugfs", "/test", "debugfs", MS_MGC_VAL, NULL) = -1 EPERM (Operation not permitted) Now the question is: Is it a kernel bug or expected behavior ie. inside user namespace we have always limited permissions even if uid=0 inside container is mapped to uid=0 outside? # cat /proc/$$/uid_map 0 0 1 # cat /proc/$$/gid_map 0 0 1 # cat /proc/$$/status | grep Cap CapInh: 0000000000000000 CapPrm: 0000001fffffffff CapEff: 0000001fffffffff CapBnd: 0000001fffffffff -- Piotr Bartosiewicz

On Wed, Jan 29, 2014 at 12:35:25PM +0100, Piotr Bartosiewicz wrote:
On 28.01.2014 12:46, Daniel P. Berrange wrote:
Hi there!
I am trying to turn on user namespace by adding following lines to the config:
<idmap>
<uid start='0' target='0' count='100000'/>
<gid start='0' target='0' count='100000'/>
</idmap>
As you can see the root in container is mapped to the root outside. I was expected to see no difference after adding this lines, but unfortunately there are some (see details below).
Am I missing something or is there a problem with system, libvirt or kernel? I've not had any chance to try LXC + user namespaces + systemd yet, but
On Tue, Jan 28, 2014 at 12:32:41PM +0100, Jan Olszak wrote: based on the list of things which fail, it seems like it might not be detecting that it is inside a container. Seems almost like it has still got the CAP_MKNOD permission and so is strying to start things it should not have like udev, and various filesystems.
Daniel
I was able to reduce the problem by not using libvirt nor systemd.
I've created a bash process inside user namespace with mapping root_inside<->root_outside. I've used a program from https://lwn.net/Articles/532593/ : ./userns_child_exec -U -M '0 0 1' -G '0 0 1' bash This program simply calls clone with CLONE_NEWUSER flag and set proper uid_map and gid_map.
The test commands are as follows: mkdir /test mount debugfs /test -t debugfs
and strace shows: mount("debugfs", "/test", "debugfs", MS_MGC_VAL, NULL) = -1 EPERM (Operation not permitted)
Now the question is: Is it a kernel bug or expected behavior ie. inside user namespace we have always limited permissions even if uid=0 inside container is mapped to uid=0 outside?
uid==0 inside the container will not have exactly the same permissions as uid==0 in the host. The reason is due to the way the kernel is checking capabilities. When a syscall requires CAP_SYS_ADMIN, for example, the kernel will either use capable(CAP_SYS_ADMIN) which only succeeds in the host, or ns_capable(CAP_SYS_ADMIN) which is allowed to suceed in the container. Different filesystems have differing restrictions, but at this time the vast majority of filesystems require that capable(CAP_SYS_ADMIN) succeeed and thus you can only mount them in the host. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (3)
-
Daniel P. Berrange
-
Jan Olszak
-
Piotr Bartosiewicz