[libvirt] How does virsh lxc-enter-namespace work? Does it?

Hi! I'm facing the issue that "virsh lxc-enter-namespace ..." does not work for me. setns() always fails with EINVAL. Reading the code confused me a bit, maybe you can help me. :D virsh itself calls: cmdLxcEnterNamespace() virDomainLxcOpenNamespace() conn->driver->domainLxcOpenNamespace() Here comes the first thing that is not clear to me. conn->driver seems to be the remote driver and therefore ->domainLxcOpenNamespace is remoteDomainLxcOpenNamespace() Why is lxc:/// a remote connection? remoteDomainLxcOpenNamespace() does a rpc call to libvirtd. On the remote side libvirtd does: lxcDispatchDomainOpenNamespace(), which opens the namespace fds, and sends them back as result. How can this work? Does it somewhere magic file descriptor passing on AF_UNIX? virsh then receives the fd's (pure numbers) and setns() failed badly. Wouldn't it make much more sense to do the open(/proc/XXX/ns/{mnt, user, ...}) and setns() calls directly on the local side? IOW directly in virsh? driver->domainLxcOpenNamespace() should only report the process id of the container's init process. Thanks, //richard

On Thu, Jun 06, 2013 at 08:57:21AM +0200, Richard Weinberger wrote:
Hi!
I'm facing the issue that "virsh lxc-enter-namespace ..." does not work for me. setns() always fails with EINVAL.
Reading the code confused me a bit, maybe you can help me. :D
virsh itself calls: cmdLxcEnterNamespace() virDomainLxcOpenNamespace() conn->driver->domainLxcOpenNamespace()
Here comes the first thing that is not clear to me. conn->driver seems to be the remote driver and therefore ->domainLxcOpenNamespace is remoteDomainLxcOpenNamespace() Why is lxc:/// a remote connection?
remoteDomainLxcOpenNamespace() does a rpc call to libvirtd.
On the remote side libvirtd does:
lxcDispatchDomainOpenNamespace(), which opens the namespace fds, and sends them back as result. How can this work? Does it somewhere magic file descriptor passing on AF_UNIX?
Yes, we use SCM_RIGHTS to pass FDs.
virsh then receives the fd's (pure numbers) and setns() failed badly.
Wouldn't it make much more sense to do the open(/proc/XXX/ns/{mnt, user, ...}) and setns() calls directly on the local side? IOW directly in virsh? driver->domainLxcOpenNamespace() should only report the process id of the container's init process.
The reason for doing it server side is to get privilege separation. eg libvirtd runs privileged to open the fds, and virsh can run unprivileged with setns(). Unfortunately it seems the kernel doesn't allow for the thing calling setns() to be unprivileged at this time, but the design allows for this enhancement in the future. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Am 06.06.2013 09:56, schrieb Daniel P. Berrange:
On Thu, Jun 06, 2013 at 08:57:21AM +0200, Richard Weinberger wrote:
Hi!
I'm facing the issue that "virsh lxc-enter-namespace ..." does not work for me. setns() always fails with EINVAL.
Reading the code confused me a bit, maybe you can help me. :D
virsh itself calls: cmdLxcEnterNamespace() virDomainLxcOpenNamespace() conn->driver->domainLxcOpenNamespace()
Here comes the first thing that is not clear to me. conn->driver seems to be the remote driver and therefore ->domainLxcOpenNamespace is remoteDomainLxcOpenNamespace() Why is lxc:/// a remote connection?
remoteDomainLxcOpenNamespace() does a rpc call to libvirtd.
On the remote side libvirtd does:
lxcDispatchDomainOpenNamespace(), which opens the namespace fds, and sends them back as result. How can this work? Does it somewhere magic file descriptor passing on AF_UNIX?
Yes, we use SCM_RIGHTS to pass FDs.
virsh then receives the fd's (pure numbers) and setns() failed badly.
Wouldn't it make much more sense to do the open(/proc/XXX/ns/{mnt, user, ...}) and setns() calls directly on the local side? IOW directly in virsh? driver->domainLxcOpenNamespace() should only report the process id of the container's init process.
The reason for doing it server side is to get privilege separation. eg libvirtd runs privileged to open the fds, and virsh can run unprivileged with setns(). Unfortunately it seems the kernel doesn't allow for the thing calling setns() to be unprivileged at this time, but the design allows for this enhancement in the future.
setns() needs CAP_SYS_ADMIN() and the manpage also says: ERRORS: ... EINVAL fd refers to a namespace whose type does not match that specified in nstype, or there is problem with reassociating the the thread with the specified namespace. I'm sure in my case setns() fails because the calling thread did not open() the ns files itself. What is the plan to make lxc-enter-namespace work? Privilege separation is nice but as of now the kernel interface (setns()) seems not to allow this. Are you forcing the kernel guys to change the interface? In the meanwhile I'll use util-linux's nsenter which works fine. Thanks, //richard

On Thu, Jun 06, 2013 at 10:07:26AM +0200, Richard Weinberger wrote:
Am 06.06.2013 09:56, schrieb Daniel P. Berrange:
On Thu, Jun 06, 2013 at 08:57:21AM +0200, Richard Weinberger wrote:
Hi!
I'm facing the issue that "virsh lxc-enter-namespace ..." does not work for me. setns() always fails with EINVAL.
Reading the code confused me a bit, maybe you can help me. :D
virsh itself calls: cmdLxcEnterNamespace() virDomainLxcOpenNamespace() conn->driver->domainLxcOpenNamespace()
Here comes the first thing that is not clear to me. conn->driver seems to be the remote driver and therefore ->domainLxcOpenNamespace is remoteDomainLxcOpenNamespace() Why is lxc:/// a remote connection?
remoteDomainLxcOpenNamespace() does a rpc call to libvirtd.
On the remote side libvirtd does:
lxcDispatchDomainOpenNamespace(), which opens the namespace fds, and sends them back as result. How can this work? Does it somewhere magic file descriptor passing on AF_UNIX?
Yes, we use SCM_RIGHTS to pass FDs.
virsh then receives the fd's (pure numbers) and setns() failed badly.
Wouldn't it make much more sense to do the open(/proc/XXX/ns/{mnt, user, ...}) and setns() calls directly on the local side? IOW directly in virsh? driver->domainLxcOpenNamespace() should only report the process id of the container's init process.
The reason for doing it server side is to get privilege separation. eg libvirtd runs privileged to open the fds, and virsh can run unprivileged with setns(). Unfortunately it seems the kernel doesn't allow for the thing calling setns() to be unprivileged at this time, but the design allows for this enhancement in the future.
setns() needs CAP_SYS_ADMIN() and the manpage also says:
The hope is that this can be relaxed - it ought to be sufficient to just restrict access to the /proc/$PID/ns/ files to enforce permissions, or require CAP_SYS_ADMIN when opening the files only. I can't see any compelling reason why you should require CAP_SYS_ADMIN on setns() itself once you have the FDs open.
ERRORS: ... EINVAL fd refers to a namespace whose type does not match that specified in nstype, or there is problem with reassociating the the thread with the specified namespace.
I'm sure in my case setns() fails because the calling thread did not open() the ns files itself.
Do you have user namespaces enabled by chance ?
What is the plan to make lxc-enter-namespace work? Privilege separation is nice but as of now the kernel interface (setns()) seems not to allow this. Are you forcing the kernel guys to change the interface?
It has long worked fine on Fedora, though we do not have user namespaces enabled since parts of the kernel are yet to be ported to that (XFS in particular). My best guess is that user namespaces may have caused a regression in this ability to call setns() from a separate process. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Am 06.06.2013 10:13, schrieb Daniel P. Berrange:
On Thu, Jun 06, 2013 at 10:07:26AM +0200, Richard Weinberger wrote:
Am 06.06.2013 09:56, schrieb Daniel P. Berrange:
On Thu, Jun 06, 2013 at 08:57:21AM +0200, Richard Weinberger wrote:
Hi!
I'm facing the issue that "virsh lxc-enter-namespace ..." does not work for me. setns() always fails with EINVAL.
Reading the code confused me a bit, maybe you can help me. :D
virsh itself calls: cmdLxcEnterNamespace() virDomainLxcOpenNamespace() conn->driver->domainLxcOpenNamespace()
Here comes the first thing that is not clear to me. conn->driver seems to be the remote driver and therefore ->domainLxcOpenNamespace is remoteDomainLxcOpenNamespace() Why is lxc:/// a remote connection?
remoteDomainLxcOpenNamespace() does a rpc call to libvirtd.
On the remote side libvirtd does:
lxcDispatchDomainOpenNamespace(), which opens the namespace fds, and sends them back as result. How can this work? Does it somewhere magic file descriptor passing on AF_UNIX?
Yes, we use SCM_RIGHTS to pass FDs.
virsh then receives the fd's (pure numbers) and setns() failed badly.
Wouldn't it make much more sense to do the open(/proc/XXX/ns/{mnt, user, ...}) and setns() calls directly on the local side? IOW directly in virsh? driver->domainLxcOpenNamespace() should only report the process id of the container's init process.
The reason for doing it server side is to get privilege separation. eg libvirtd runs privileged to open the fds, and virsh can run unprivileged with setns(). Unfortunately it seems the kernel doesn't allow for the thing calling setns() to be unprivileged at this time, but the design allows for this enhancement in the future.
setns() needs CAP_SYS_ADMIN() and the manpage also says:
The hope is that this can be relaxed - it ought to be sufficient to just restrict access to the /proc/$PID/ns/ files to enforce permissions, or require CAP_SYS_ADMIN when opening the files only. I can't see any compelling reason why you should require CAP_SYS_ADMIN on setns() itself once you have the FDs open.
ERRORS: ... EINVAL fd refers to a namespace whose type does not match that specified in nstype, or there is problem with reassociating the the thread with the specified namespace.
I'm sure in my case setns() fails because the calling thread did not open() the ns files itself.
Do you have user namespaces enabled by chance ?
Yeah, but only within the kernel. Maybe this is why the setns() is failing for me.
What is the plan to make lxc-enter-namespace work? Privilege separation is nice but as of now the kernel interface (setns()) seems not to allow this. Are you forcing the kernel guys to change the interface?
It has long worked fine on Fedora, though we do not have user namespaces enabled since parts of the kernel are yet to be ported to that (XFS in particular). My best guess is that user namespaces may have caused a regression in this ability to call setns() from a separate process.
Sounds sane. /me looks. :-) Thanks, //richard

Am 06.06.2013 10:13, schrieb Daniel P. Berrange:
On Thu, Jun 06, 2013 at 10:07:26AM +0200, Richard Weinberger wrote:
Am 06.06.2013 09:56, schrieb Daniel P. Berrange:
On Thu, Jun 06, 2013 at 08:57:21AM +0200, Richard Weinberger wrote:
Hi!
I'm facing the issue that "virsh lxc-enter-namespace ..." does not work for me. setns() always fails with EINVAL.
Reading the code confused me a bit, maybe you can help me. :D
virsh itself calls: cmdLxcEnterNamespace() virDomainLxcOpenNamespace() conn->driver->domainLxcOpenNamespace()
Here comes the first thing that is not clear to me. conn->driver seems to be the remote driver and therefore ->domainLxcOpenNamespace is remoteDomainLxcOpenNamespace() Why is lxc:/// a remote connection?
remoteDomainLxcOpenNamespace() does a rpc call to libvirtd.
On the remote side libvirtd does:
lxcDispatchDomainOpenNamespace(), which opens the namespace fds, and sends them back as result. How can this work? Does it somewhere magic file descriptor passing on AF_UNIX?
Yes, we use SCM_RIGHTS to pass FDs.
virsh then receives the fd's (pure numbers) and setns() failed badly.
Wouldn't it make much more sense to do the open(/proc/XXX/ns/{mnt, user, ...}) and setns() calls directly on the local side? IOW directly in virsh? driver->domainLxcOpenNamespace() should only report the process id of the container's init process.
The reason for doing it server side is to get privilege separation. eg libvirtd runs privileged to open the fds, and virsh can run unprivileged with setns(). Unfortunately it seems the kernel doesn't allow for the thing calling setns() to be unprivileged at this time, but the design allows for this enhancement in the future.
setns() needs CAP_SYS_ADMIN() and the manpage also says:
The hope is that this can be relaxed - it ought to be sufficient to just restrict access to the /proc/$PID/ns/ files to enforce permissions, or require CAP_SYS_ADMIN when opening the files only. I can't see any compelling reason why you should require CAP_SYS_ADMIN on setns() itself once you have the FDs open.
ERRORS: ... EINVAL fd refers to a namespace whose type does not match that specified in nstype, or there is problem with reassociating the the thread with the specified namespace.
I'm sure in my case setns() fails because the calling thread did not open() the ns files itself.
Do you have user namespaces enabled by chance ?
What is the plan to make lxc-enter-namespace work? Privilege separation is nice but as of now the kernel interface (setns()) seems not to allow this. Are you forcing the kernel guys to change the interface?
It has long worked fine on Fedora, though we do not have user namespaces enabled since parts of the kernel are yet to be ported to that (XFS in particular). My best guess is that user namespaces may have caused a regression in this ability to call setns() from a separate process.
I can confirm that lxc-enter-namespace works fine when I disable CONFIG_USER_NS in my kernel. Currently I'm moving my old LXC setup over to libvirt and later I'll enable user namespaces too. Let's see what else breaks. ;-) Stay tuned! Thanks, //richard

On Thu, Jun 06, 2013 at 09:13:27AM +0100, Daniel P. Berrange wrote:
On Thu, Jun 06, 2013 at 10:07:26AM +0200, Richard Weinberger wrote:
I'm sure in my case setns() fails because the calling thread did not open() the ns files itself.
Do you have user namespaces enabled by chance ?
What is the plan to make lxc-enter-namespace work? Privilege separation is nice but as of now the kernel interface (setns()) seems not to allow this. Are you forcing the kernel guys to change the interface?
It has long worked fine on Fedora, though we do not have user namespaces enabled since parts of the kernel are yet to be ported to that (XFS in particular). My best guess is that user namespaces may have caused a regression in this ability to call setns() from a separate process.
The problem is actually that you're not allowed to call setns(fd) for a fd which refers to your current namespace. The fd must refer to a different namespace. Of course the code is opening the '/proc/$PID/ns/user' file even though libvirt doesn't give the container a new user namespace. The simplest fix is to just ignore EINVAL from setns(), since we can't easily figure out if the calling apps' namespace matches the namespace of the container. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Am 07.06.2013 17:34, schrieb Daniel P. Berrange:
On Thu, Jun 06, 2013 at 09:13:27AM +0100, Daniel P. Berrange wrote:
On Thu, Jun 06, 2013 at 10:07:26AM +0200, Richard Weinberger wrote:
I'm sure in my case setns() fails because the calling thread did not open() the ns files itself.
Do you have user namespaces enabled by chance ?
What is the plan to make lxc-enter-namespace work? Privilege separation is nice but as of now the kernel interface (setns()) seems not to allow this. Are you forcing the kernel guys to change the interface?
It has long worked fine on Fedora, though we do not have user namespaces enabled since parts of the kernel are yet to be ported to that (XFS in particular). My best guess is that user namespaces may have caused a regression in this ability to call setns() from a separate process.
The problem is actually that you're not allowed to call setns(fd) for a fd which refers to your current namespace. The fd must refer to a different namespace. Of course the code is opening the '/proc/$PID/ns/user' file even though libvirt doesn't give the container a new user namespace. The simplest fix is to just ignore EINVAL from setns(), since we can't easily figure out if the calling apps' namespace matches the namespace of the container.
Thanks a ton for figuring that out! Thanks, //richard
participants (2)
-
Daniel P. Berrange
-
Richard Weinberger