[libvirt] [PATCH] LXC: make sure fuse thread start to run before we do clone

I met a problem that container blocked by seteuid/setegid which is call in lxcContainerSetID on UP system and libvirt compiled with --with-fuse=yes. I looked into the glibc's codes, and found setxid in glibc calls futex() to wait for other threads to change their setxid_futex to 0(see setxid_mark_thread in glibc). since the process created by clone system call will not share the memory with the other threads and the context of memory doesn't changed until we call execl.(COW) So if the process which created by clone is called before fuse thread being stated, the new setxid_futex of fuse thread will not be saw in this process, it will be blocked forever. Maybe this problem should be fixed in glibc, but I send this patch as a quick fix. Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> --- src/lxc/lxc_controller.c | 13 ++++++++++++- src/lxc/lxc_fuse.c | 6 +++++- 2 files changed, 17 insertions(+), 2 deletions(-) diff --git a/src/lxc/lxc_controller.c b/src/lxc/lxc_controller.c index c8f68c0..ed83bb3 100644 --- a/src/lxc/lxc_controller.c +++ b/src/lxc/lxc_controller.c @@ -1977,7 +1977,18 @@ cleanup: static int virLXCControllerSetupFuse(virLXCControllerPtr ctrl) { - return lxcSetupFuse(&ctrl->fuse, ctrl->def); + int ret = lxcSetupFuse(&ctrl->fuse, ctrl->def); + + if (!ret) { + /* Wait for fuse thread starting run, so we + * can make sure the setxid_futex of fuse thread + * is 0(see start_thread of glibc), otherwise + * the lxcContainerChild will block at setxid. */ + virMutexLock(&ctrl->fuse->lock); + virMutexUnlock(&ctrl->fuse->lock); + } + + return ret; } static int diff --git a/src/lxc/lxc_fuse.c b/src/lxc/lxc_fuse.c index 9d12832..8cddfa8 100644 --- a/src/lxc/lxc_fuse.c +++ b/src/lxc/lxc_fuse.c @@ -272,6 +272,8 @@ static void lxcFuseDestroy(virLXCFusePtr fuse) static void lxcFuseRun(void *opaque) { virLXCFusePtr fuse = opaque; + /* Let libvirt_lxc continue. */ + virMutexUnlock(&fuse->lock); if (fuse_loop(fuse->fuse) < 0) virReportError(VIR_ERR_INTERNAL_ERROR, "%s", @@ -321,7 +323,9 @@ int lxcSetupFuse(virLXCFusePtr *f, virDomainDefPtr def) fuse_unmount(fuse->mountpoint, fuse->ch); goto cleanup1; } - + /* Get mutex lock, lxcFuseRun will unlock it. this will + * cause libvirt_lxc wait for the fuse thread starting. */ + virMutexLock(&fuse->lock); if (virThreadCreate(&fuse->thread, false, lxcFuseRun, (void *)fuse) < 0) { lxcFuseDestroy(fuse); -- 1.8.3.1

On Thu, Nov 07, 2013 at 09:15:43PM +0800, Gao feng wrote:
I met a problem that container blocked by seteuid/setegid which is call in lxcContainerSetID on UP system and libvirt compiled with --with-fuse=yes.
I looked into the glibc's codes, and found setxid in glibc calls futex() to wait for other threads to change their setxid_futex to 0(see setxid_mark_thread in glibc).
since the process created by clone system call will not share the memory with the other threads and the context of memory doesn't changed until we call execl.(COW)
So if the process which created by clone is called before fuse thread being stated, the new setxid_futex of fuse thread will not be saw in this process, it will be blocked forever.
Maybe this problem should be fixed in glibc, but I send this patch as a quick fix.
Can you show a stack trace of the threads/processes deadlocking Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Fri, Nov 08, 2013 at 01:30:09PM +0800, Daniel P. Berrange wrote:
On Thu, Nov 07, 2013 at 09:15:43PM +0800, Gao feng wrote:
I met a problem that container blocked by seteuid/setegid which is call in lxcContainerSetID on UP system and libvirt compiled with --with-fuse=yes.
I looked into the glibc's codes, and found setxid in glibc calls futex() to wait for other threads to change their setxid_futex to 0(see setxid_mark_thread in glibc).
since the process created by clone system call will not share the memory with the other threads and the context of memory doesn't changed until we call execl.(COW)
So if the process which created by clone is called before fuse thread being stated, the new setxid_futex of fuse thread will not be saw in this process, it will be blocked forever.
Maybe this problem should be fixed in glibc, but I send this patch as a quick fix.
Can you show a stack trace of the threads/processes deadlocking
I think this is a symptom of setxid not being async-signal-safe like it's required to be. I'm not sure if we have a bug tracker entry for that; if not, it should be added. But if clone() is being used except in a fork-like manner, this is probably invalid application usage too. Rich

On 11/09/2013 03:42 AM, Rich Felker wrote:
On Fri, Nov 08, 2013 at 01:30:09PM +0800, Daniel P. Berrange wrote:
On Thu, Nov 07, 2013 at 09:15:43PM +0800, Gao feng wrote:
I met a problem that container blocked by seteuid/setegid which is call in lxcContainerSetID on UP system and libvirt compiled with --with-fuse=yes.
I looked into the glibc's codes, and found setxid in glibc calls futex() to wait for other threads to change their setxid_futex to 0(see setxid_mark_thread in glibc).
since the process created by clone system call will not share the memory with the other threads and the context of memory doesn't changed until we call execl.(COW)
So if the process which created by clone is called before fuse thread being stated, the new setxid_futex of fuse thread will not be saw in this process, it will be blocked forever.
Maybe this problem should be fixed in glibc, but I send this patch as a quick fix.
Can you show a stack trace of the threads/processes deadlocking
I think this is a symptom of setxid not being async-signal-safe like it's required to be. I'm not sure if we have a bug tracker entry for that; if not, it should be added. But if clone() is being used except in a fork-like manner, this is probably invalid application usage too.
I post a patch to the glibc community, but I can't find my patch on the mail list archive. the patch is attached. do you think this glibc patch is needed or we just should add some bug tracker on manpage?

On Fri, Nov 08, 2013 at 02:42:26PM -0500, Rich Felker wrote:
On Fri, Nov 08, 2013 at 01:30:09PM +0800, Daniel P. Berrange wrote:
On Thu, Nov 07, 2013 at 09:15:43PM +0800, Gao feng wrote:
I met a problem that container blocked by seteuid/setegid which is call in lxcContainerSetID on UP system and libvirt compiled with --with-fuse=yes.
I looked into the glibc's codes, and found setxid in glibc calls futex() to wait for other threads to change their setxid_futex to 0(see setxid_mark_thread in glibc).
since the process created by clone system call will not share the memory with the other threads and the context of memory doesn't changed until we call execl.(COW)
So if the process which created by clone is called before fuse thread being stated, the new setxid_futex of fuse thread will not be saw in this process, it will be blocked forever.
Maybe this problem should be fixed in glibc, but I send this patch as a quick fix.
Can you show a stack trace of the threads/processes deadlocking
I think this is a symptom of setxid not being async-signal-safe like it's required to be. I'm not sure if we have a bug tracker entry for that; if not, it should be added. But if clone() is being used except in a fork-like manner, this is probably invalid application usage too.
We are not using clone() in a manner that is strictly equivalent to fork(). Libvirt is using clone() to create Linux containers with new namespaces. eg we do clone(CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWNET|SIGCHLD) IIUC, if a process is multi-threaded you should restrict yourself to use of async signal safe functions in between fork() and exec(). I assume this restriction applies to clone() and exec() pairings too. Libvirt is in fact violating rules about only using async signal safe functions between clone() and exec() in many places. So I think what we need to do is avoid starting any threads in the parent until after we've clone()'d to create the new child namespace. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Wed, Nov 13, 2013 at 02:53:05PM +0000, Daniel P. Berrange wrote:
On Fri, Nov 08, 2013 at 02:42:26PM -0500, Rich Felker wrote:
On Fri, Nov 08, 2013 at 01:30:09PM +0800, Daniel P. Berrange wrote:
On Thu, Nov 07, 2013 at 09:15:43PM +0800, Gao feng wrote:
I met a problem that container blocked by seteuid/setegid which is call in lxcContainerSetID on UP system and libvirt compiled with --with-fuse=yes.
I looked into the glibc's codes, and found setxid in glibc calls futex() to wait for other threads to change their setxid_futex to 0(see setxid_mark_thread in glibc).
since the process created by clone system call will not share the memory with the other threads and the context of memory doesn't changed until we call execl.(COW)
So if the process which created by clone is called before fuse thread being stated, the new setxid_futex of fuse thread will not be saw in this process, it will be blocked forever.
Maybe this problem should be fixed in glibc, but I send this patch as a quick fix.
Can you show a stack trace of the threads/processes deadlocking
I think this is a symptom of setxid not being async-signal-safe like it's required to be. I'm not sure if we have a bug tracker entry for that; if not, it should be added. But if clone() is being used except in a fork-like manner, this is probably invalid application usage too.
We are not using clone() in a manner that is strictly equivalent to fork(). Libvirt is using clone() to create Linux containers with new namespaces. eg we do
clone(CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWNET|SIGCHLD)
Understood. I still call this a fork-like manner since it's not sharing VM or using CLONE_THREAD and using the default signal of SIGCHLD. BTW is there a reason to prefer this usage over regular fork followed by unshare()?
IIUC, if a process is multi-threaded you should restrict yourself to use of async signal safe functions in between fork() and exec(). I assume this restriction applies to clone() and exec() pairings too.
Libvirt is in fact violating rules about only using async signal safe functions between clone() and exec() in many places. So I think what we need to do is avoid starting any threads in the parent until after we've clone()'d to create the new child namespace.
Per the specification, setuid is AS-safe. However glibc fails to meet this requirement (it's actually very hard to meet due to Linux limitations in how the kernel manages uids/gids). So for now, avoiding starting threads until after performing clone() is probably a better solution than trying to eliminate calls to non-AS-safe functions. Rich

On 11/13/2013 11:16 AM, Rich Felker wrote:
We are not using clone() in a manner that is strictly equivalent to fork(). Libvirt is using clone() to create Linux containers with new namespaces. eg we do
clone(CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWNET|SIGCHLD)
Understood. I still call this a fork-like manner since it's not sharing VM or using CLONE_THREAD and using the default signal of SIGCHLD. BTW is there a reason to prefer this usage over regular fork followed by unshare()?
Yes. Per 'man 2 unshare', CLONE_NEWPID is not supported with unshare(), yet we require our child to have pid 1 in its new pid namespace. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Wed, Nov 13, 2013 at 11:33:46AM -0700, Eric Blake wrote:
On 11/13/2013 11:16 AM, Rich Felker wrote:
We are not using clone() in a manner that is strictly equivalent to fork(). Libvirt is using clone() to create Linux containers with new namespaces. eg we do
clone(CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWNET|SIGCHLD)
Understood. I still call this a fork-like manner since it's not sharing VM or using CLONE_THREAD and using the default signal of SIGCHLD. BTW is there a reason to prefer this usage over regular fork followed by unshare()?
Yes. Per 'man 2 unshare', CLONE_NEWPID is not supported with unshare(), yet we require our child to have pid 1 in its new pid namespace.
Yeah, I also wish we could use unshare() instead of clone(), but the CLONE_NEWPID design limitation is a blocker for that. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 11/13/2013 10:53 PM, Daniel P. Berrange wrote:
On Fri, Nov 08, 2013 at 02:42:26PM -0500, Rich Felker wrote:
On Fri, Nov 08, 2013 at 01:30:09PM +0800, Daniel P. Berrange wrote:
On Thu, Nov 07, 2013 at 09:15:43PM +0800, Gao feng wrote:
I met a problem that container blocked by seteuid/setegid which is call in lxcContainerSetID on UP system and libvirt compiled with --with-fuse=yes.
I looked into the glibc's codes, and found setxid in glibc calls futex() to wait for other threads to change their setxid_futex to 0(see setxid_mark_thread in glibc).
since the process created by clone system call will not share the memory with the other threads and the context of memory doesn't changed until we call execl.(COW)
So if the process which created by clone is called before fuse thread being stated, the new setxid_futex of fuse thread will not be saw in this process, it will be blocked forever.
Maybe this problem should be fixed in glibc, but I send this patch as a quick fix.
Can you show a stack trace of the threads/processes deadlocking
I think this is a symptom of setxid not being async-signal-safe like it's required to be. I'm not sure if we have a bug tracker entry for that; if not, it should be added. But if clone() is being used except in a fork-like manner, this is probably invalid application usage too.
We are not using clone() in a manner that is strictly equivalent to fork(). Libvirt is using clone() to create Linux containers with new namespaces. eg we do
clone(CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWNET|SIGCHLD)
IIUC, if a process is multi-threaded you should restrict yourself to use of async signal safe functions in between fork() and exec(). I assume this restriction applies to clone() and exec() pairings too.
Libvirt is in fact violating rules about only using async signal safe functions between clone() and exec() in many places. So I think what we need to do is avoid starting any threads in the parent until after we've clone()'d to create the new child namespace.
Thanks for fuse, any tring to access files exported by fuse will be blocked until the fuse thread starts do fuse_loop. I will post a update. Thanks guys.

On 11/08/2013 01:30 PM, Daniel P. Berrange wrote:
On Thu, Nov 07, 2013 at 09:15:43PM +0800, Gao feng wrote:
I met a problem that container blocked by seteuid/setegid which is call in lxcContainerSetID on UP system and libvirt compiled with --with-fuse=yes.
I looked into the glibc's codes, and found setxid in glibc calls futex() to wait for other threads to change their setxid_futex to 0(see setxid_mark_thread in glibc).
since the process created by clone system call will not share the memory with the other threads and the context of memory doesn't changed until we call execl.(COW)
So if the process which created by clone is called before fuse thread being stated, the new setxid_futex of fuse thread will not be saw in this process, it will be blocked forever.
Maybe this problem should be fixed in glibc, but I send this patch as a quick fix.
Can you show a stack trace of the threads/processes deadlocking
Sure the libvirt_lxc tasks root 7922 0.0 0.1 118976 3704 ? Ssl 20:55 0:00 /usr/local/libexec/libvirt_lxc --name chx3 --console 17 --security=selinux --handshake 20 --background --veth vnet1 root 7927 0.0 0.1 53440 3072 ? S 20:55 0:00 /usr/local/libexec/libvirt_lxc --name chx3 --console 17 --security=selinux --handshake 20 --background --veth vnet1 the pid of fuse thread is 7925 [root@localhost ~]# ls /proc/7922/task/ 7922 7925 gdb -p 7925 (gdb) bt #0 0x00007f2d39bcb83d in read () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f2d3a5dfb72 in fuse_kern_chan_receive () from /glibc/lib/libfuse.so.2 #2 0x00007f2d3a5e0b16 in fuse_ll_receive_buf () from /glibc/lib/libfuse.so.2 #3 0x00007f2d3a5dfdd1 in fuse_session_loop () from /glibc/lib/libfuse.so.2 #4 0x00007f2d3a5d8468 in fuse_loop () from /glibc/lib/libfuse.so.2 #5 0x00007f2d3aa55691 in lxcFuseRun (opaque=opaque@entry=0x7f2d3b13a420) at lxc/lxc_fuse.c:276 #6 0x00007f2d3aaebb8e in virThreadHelper (data=<optimized out>) at util/virthreadpthread.c:161 #7 0x00007f2d39bc4f22 in start_thread (arg=0x7f2d37fbc700) at pthread_create.c:309 #8 0x00007f2d392ca6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 and the arg of start_thread is the struct pthread of fuse thread. you can see the setxid_futex of fuse pthread has been set to 0. (gdb) p *(struct pthread*)0x7f2d37fbc700 $1 = {{header = {tcb = 0x7f2d37fbc700, dtv = 0x7f2d3b2c9ae0, self = 0x7f2d37fbc700, multiple_threads = 1, gscope_flag = 0, sysinfo = 0, stack_guard = 5516672127090939392, pointer_guard = 9991483700321457629, vgetcpu_cache = {0, 0}, __unused1 = 0, rtld_must_xmm_save = 0, __private_tm = {0x0, 0x0, 0x0, 0x0}, __private_ss = 0x0, __unused2 = 0, rtld_savespace_sse = {{{i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = { 0, 0, 0, 0}}, {i = {0, 0, 0, 0}}}, {{i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}}, {{i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}}, {{i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}}, {{i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}}, {{i = {0, 0, 0, 0}}, { i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}}, {{i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}}, {{i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}}}, __padding = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}, __padding = {0x7f2d37fbc700, 0x7f2d3b2c9ae0, 0x7f2d37fbc700, 0x1, 0x0, 0x4c8f28122d8dd600, 0x8aa8e17d00c415dd, 0x0 <repeats 17 times>}}, list = {next = 0x7f2d39dd5270 <stack_used>, prev = 0x7f2d39dd5270 <stack_used>}, tid = 7925, pid = 7922, robust_prev = 0x7f2d37fbc9e0, robust_head = {list = 0x7f2d37fbc9e0, futex_offset = -32, list_op_pending = 0x0}, cleanup = 0x0, cleanup_jmp_buf = 0x7f2d37fbbe30, cancelhandling = 2, flags = 1, specific_1stblock = {{seq = 0, data = 0x0}, {seq = 0, data = 0x0}, {seq = 0, data = 0x0}, {seq = 1, data = 0x7f2d30021960}, {seq = 0, data = 0x0} <repeats 28 times>}, specific = {0x7f2d37fbca10, 0x0 <repeats 31 times>}, specific_used = true, report_events = false, user_stack = false, stopped_start = false, parent_cancelhandling = 0, lock = 0, *setxid_futex* = 0, cpuclock_offset = 1398764389412, joinid = 0x7f2d37fbc700, result = 0x0, schedparam = {__sched_priority = 0}, schedpolicy = 0, start_routine = 0x7f2d3aaebb60 <virThreadHelper>, arg = 0x7f2d3b2bdce0, eventbuf = {eventmask = {event_bits = {0, 0}}, eventnum = TD_ALL_EVENTS, eventdata = 0x0}, nextevent = 0x0, exc = { exception_class = 0, exception_cleanup = 0x0, private_1 = 0, private_2 = 0}, stackblock = 0x7f2d377bc000, stackblock_size = 8392704, guardsize = 4096, reported_guardsize = 4096, tpp = 0x0, res = { retrans = 0, retry = 0, options = 0, nscount = 0, nsaddr_list = {{sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}}, id = 0, dnsrch = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, defdname = '\000' <repeats 255 times>, pfcode = 0, ndots = 0, nsort = 0, ipv6_unavail = 0, unused = 0, sort_list = {{addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}}, qhook = 0x0, rhook = 0x0, res_h_errno = 0, _vcsock = 0, _flags = 0, _u = {pad = '\000' <repeats 51 times>, _ext = { nscount = 0, nsmap = {0, 0, 0}, nssocks = {0, 0, 0}, nscount6 = 0, nsinit = 0, nsaddrs = {0x0, 0x0, 0x0}, initstamp = 0}}}, end_padding = 0x7f2d37fbcff0 ""} For the cloned process 7927 gdb -p 7927 (gdb) bt #0 setxid_mark_thread (cmdp=0x7f2d3b2ef900, t=0x7f2d37fbc700) at allocatestack.c:994 #1 __nptl_setxid (cmdp=0x7f2d3b2ef900) at allocatestack.c:1086 #2 0x00007f2d392c1da1 in __setregid (rgid=rgid@entry=0, egid=egid@entry=0) at ../sysdeps/unix/sysv/linux/setregid.c:26 #3 0x00007f2d3aaf33f0 in virSetUIDGID (uid=uid@entry=0, gid=gid@entry=0, groups=groups@entry=0x0, ngroups=ngroups@entry=0) at util/virutil.c:1055 #4 0x00007f2d3aa51b3c in lxcContainerSetID (def=0x7f2d3b141190) at lxc/lxc_container.c:427 #5 lxcContainerChild (data=0x7fff40c4d960) at lxc/lxc_container.c:1829 #6 0x00007f2d392ca6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 the setxid_futex of fuse pthread(0x7f2d37fbc700) is still -2. (gdb) p *t $2 = {{header = {tcb = 0x7f2d37fbc700, dtv = 0x7f2d3b2c9ae0, self = 0x7f2d37fbc700, multiple_threads = 1, gscope_flag = 0, sysinfo = 0, stack_guard = 5516672127090939392, pointer_guard = 9991483700321457629, vgetcpu_cache = {0, 0}, __unused1 = 0, rtld_must_xmm_save = 0, __private_tm = {0x0, 0x0, 0x0, 0x0}, __private_ss = 0x0, __unused2 = 0, rtld_savespace_sse = {{{i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = { 0, 0, 0, 0}}, {i = {0, 0, 0, 0}}}, {{i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}}, {{i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}}, {{i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}}, {{i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}}, {{i = {0, 0, 0, 0}}, { i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}}, {{i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}}, {{i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}, {i = {0, 0, 0, 0}}}}, __padding = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}, __padding = {0x7f2d37fbc700, 0x7f2d3b2c9ae0, 0x7f2d37fbc700, 0x1, 0x0, 0x4c8f28122d8dd600, 0x8aa8e17d00c415dd, 0x0 <repeats 17 times>}}, list = {next = 0x7f2d39dd5270 <stack_used>, prev = 0x7f2d39dd5270 <stack_used>}, tid = 7925, pid = 7922, robust_prev = 0x7f2d37fbc9e0, robust_head = {list = 0x7f2d37fbc9e0, futex_offset = -32, list_op_pending = 0x0}, cleanup = 0x0, cleanup_jmp_buf = 0x0, cancelhandling = 0, flags = 1, specific_1stblock = {{seq = 0, data = 0x0} <repeats 32 times>}, specific = {0x7f2d37fbca10, 0x0 <repeats 31 times>}, specific_used = false, report_events = false, user_stack = false, stopped_start = false, parent_cancelhandling = 0, lock = 0, *setxid_futex* = -2, cpuclock_offset = 0, joinid = 0x7f2d37fbc700, result = 0x0, schedparam = {__sched_priority = 0}, schedpolicy = 0, start_routine = 0x7f2d3aaebb60 <virThreadHelper>, arg = 0x7f2d3b2bdce0, eventbuf = {eventmask = {event_bits = {0, 0}}, eventnum = TD_ALL_EVENTS, eventdata = 0x0}, nextevent = 0x0, exc = {exception_class = 0, exception_cleanup = 0x0, private_1 = 0, private_2 = 0}, stackblock = 0x7f2d377bc000, stackblock_size = 8392704, guardsize = 4096, reported_guardsize = 4096, tpp = 0x0, res = {retrans = 0, retry = 0, options = 0, nscount = 0, nsaddr_list = {{sin_family = 0, sin_port = 0, sin_addr = { s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, {sin_family = 0, sin_port = 0, sin_addr = { s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}}, id = 0, dnsrch = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, defdname = '\000' <repeats 255 times>, pfcode = 0, ndots = 0, nsort = 0, ipv6_unavail = 0, unused = 0, sort_list = {{addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}}, qhook = 0x0, rhook = 0x0, res_h_errno = 0, _vcsock = 0, _flags = 0, _u = {pad = '\000' <repeats 51 times>, _ext = {nscount = 0, nsmap = {0, 0, 0}, nssocks = {0, 0, 0}, nscount6 = 0, nsinit = 0, nsaddrs = {0x0, 0x0, 0x0}, initstamp = 0}}}, end_padding = 0x7f2d37fbcff0 ""}

On 11/08/2013 01:30 PM, Daniel P. Berrange wrote:
On Thu, Nov 07, 2013 at 09:15:43PM +0800, Gao feng wrote:
I met a problem that container blocked by seteuid/setegid which is call in lxcContainerSetID on UP system and libvirt compiled with --with-fuse=yes.
I looked into the glibc's codes, and found setxid in glibc calls futex() to wait for other threads to change their setxid_futex to 0(see setxid_mark_thread in glibc).
since the process created by clone system call will not share the memory with the other threads and the context of memory doesn't changed until we call execl.(COW)
So if the process which created by clone is called before fuse thread being stated, the new setxid_futex of fuse thread will not be saw in this process, it will be blocked forever.
Maybe this problem should be fixed in glibc, but I send this patch as a quick fix.
Can you show a stack trace of the threads/processes deadlocking
Daniel, chould you apply this patch? since this may not be fixed in glibc quickly and we should conside libvirt works with buggy glibc.
participants (4)
-
Daniel P. Berrange
-
Eric Blake
-
Gao feng
-
Rich Felker