
On Wed, Nov 13, 2013 at 02:53:05PM +0000, Daniel P. Berrange wrote:
On Fri, Nov 08, 2013 at 02:42:26PM -0500, Rich Felker wrote:
On Fri, Nov 08, 2013 at 01:30:09PM +0800, Daniel P. Berrange wrote:
On Thu, Nov 07, 2013 at 09:15:43PM +0800, Gao feng wrote:
I met a problem that container blocked by seteuid/setegid which is call in lxcContainerSetID on UP system and libvirt compiled with --with-fuse=yes.
I looked into the glibc's codes, and found setxid in glibc calls futex() to wait for other threads to change their setxid_futex to 0(see setxid_mark_thread in glibc).
since the process created by clone system call will not share the memory with the other threads and the context of memory doesn't changed until we call execl.(COW)
So if the process which created by clone is called before fuse thread being stated, the new setxid_futex of fuse thread will not be saw in this process, it will be blocked forever.
Maybe this problem should be fixed in glibc, but I send this patch as a quick fix.
Can you show a stack trace of the threads/processes deadlocking
I think this is a symptom of setxid not being async-signal-safe like it's required to be. I'm not sure if we have a bug tracker entry for that; if not, it should be added. But if clone() is being used except in a fork-like manner, this is probably invalid application usage too.
We are not using clone() in a manner that is strictly equivalent to fork(). Libvirt is using clone() to create Linux containers with new namespaces. eg we do
clone(CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWNET|SIGCHLD)
Understood. I still call this a fork-like manner since it's not sharing VM or using CLONE_THREAD and using the default signal of SIGCHLD. BTW is there a reason to prefer this usage over regular fork followed by unshare()?
IIUC, if a process is multi-threaded you should restrict yourself to use of async signal safe functions in between fork() and exec(). I assume this restriction applies to clone() and exec() pairings too.
Libvirt is in fact violating rules about only using async signal safe functions between clone() and exec() in many places. So I think what we need to do is avoid starting any threads in the parent until after we've clone()'d to create the new child namespace.
Per the specification, setuid is AS-safe. However glibc fails to meet this requirement (it's actually very hard to meet due to Linux limitations in how the kernel manages uids/gids). So for now, avoiding starting threads until after performing clone() is probably a better solution than trying to eliminate calls to non-AS-safe functions. Rich