[libvirt] [PATCH] virLXCMonitorClose: Unlock domain while closing monitor

There's a race in lxc driver causing a deadlock. If a domain is destroyed immediately after started, the deadlock can occur. When domain is started, the even loop tries to connect to the monitor. If the connecting succeeds, virLXCProcessMonitorInitNotify() is called with @mon->client locked. The first thing that callee does, is virObjectLock(vm). So the order of locking is: 1) @mon->client, 2) @vm. However, if there's another thread executing virDomainDestroy on the very same domain, the first thing done here is locking the @vm. Then, the corresponding libvirt_lxc process is killed and monitor is closed via calling virLXCMonitorClose(). This callee tries to lock @mon->client too. So the order is reversed to the first case. This situation results in deadlock and unresponsive libvirtd (since the eventloop is involved). The proper solution is to unlock the @vm in virLXCMonitorClose prior entering virNetClientClose(). See the backtrace as follows: Thread 25 (Thread 0x7f1b7c9b8700 (LWP 16312)): 0 0x00007f1b80539714 in __lll_lock_wait () from /lib64/libpthread.so.0 1 0x00007f1b8053516c in _L_lock_516 () from /lib64/libpthread.so.0 2 0x00007f1b80534fbb in pthread_mutex_lock () from /lib64/libpthread.so.0 3 0x00007f1b82a637cf in virMutexLock (m=0x7f1b3c0038d0) at util/virthreadpthread.c:85 4 0x00007f1b82a4ccf2 in virObjectLock (anyobj=0x7f1b3c0038c0) at util/virobject.c:320 5 0x00007f1b82b861f6 in virNetClientCloseInternal (client=0x7f1b3c0038c0, reason=3) at rpc/virnetclient.c:696 6 0x00007f1b82b862f5 in virNetClientClose (client=0x7f1b3c0038c0) at rpc/virnetclient.c:721 7 0x00007f1b6ee12500 in virLXCMonitorClose (mon=0x7f1b3c007210) at lxc/lxc_monitor.c:216 8 0x00007f1b6ee129f0 in virLXCProcessCleanup (driver=0x7f1b68100240, vm=0x7f1b680ceb70, reason=VIR_DOMAIN_SHUTOFF_DESTROYED) at lxc/lxc_process.c:174 9 0x00007f1b6ee14106 in virLXCProcessStop (driver=0x7f1b68100240, vm=0x7f1b680ceb70, reason=VIR_DOMAIN_SHUTOFF_DESTROYED) at lxc/lxc_process.c:710 10 0x00007f1b6ee1aa36 in lxcDomainDestroyFlags (dom=0x7f1b5c002560, flags=0) at lxc/lxc_driver.c:1291 11 0x00007f1b6ee1ab1a in lxcDomainDestroy (dom=0x7f1b5c002560) at lxc/lxc_driver.c:1321 12 0x00007f1b82b05be5 in virDomainDestroy (domain=0x7f1b5c002560) at libvirt.c:2303 13 0x00007f1b835a7e85 in remoteDispatchDomainDestroy (server=0x7f1b857419d0, client=0x7f1b8574ae40, msg=0x7f1b8574acf0, rerr=0x7f1b7c9b7c30, args=0x7f1b5c004a50) at remote_dispatch.h:3143 14 0x00007f1b835a7d78 in remoteDispatchDomainDestroyHelper (server=0x7f1b857419d0, client=0x7f1b8574ae40, msg=0x7f1b8574acf0, rerr=0x7f1b7c9b7c30, args=0x7f1b5c004a50, ret=0x7f1b5c0029e0) at remote_dispatch.h:3121 15 0x00007f1b82b93704 in virNetServerProgramDispatchCall (prog=0x7f1b8573af90, server=0x7f1b857419d0, client=0x7f1b8574ae40, msg=0x7f1b8574acf0) at rpc/virnetserverprogram.c:435 16 0x00007f1b82b93263 in virNetServerProgramDispatch (prog=0x7f1b8573af90, server=0x7f1b857419d0, client=0x7f1b8574ae40, msg=0x7f1b8574acf0) at rpc/virnetserverprogram.c:305 17 0x00007f1b82b8c0f6 in virNetServerProcessMsg (srv=0x7f1b857419d0, client=0x7f1b8574ae40, prog=0x7f1b8573af90, msg=0x7f1b8574acf0) at rpc/virnetserver.c:163 18 0x00007f1b82b8c1da in virNetServerHandleJob (jobOpaque=0x7f1b8574dca0, opaque=0x7f1b857419d0) at rpc/virnetserver.c:184 19 0x00007f1b82a64158 in virThreadPoolWorker (opaque=0x7f1b8573cb10) at util/virthreadpool.c:144 20 0x00007f1b82a63ae5 in virThreadHelper (data=0x7f1b8574b9f0) at util/virthreadpthread.c:161 21 0x00007f1b80532f4a in start_thread () from /lib64/libpthread.so.0 22 0x00007f1b7fc4f20d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f1b83546740 (LWP 16297)): 0 0x00007f1b80539714 in __lll_lock_wait () from /lib64/libpthread.so.0 1 0x00007f1b8053516c in _L_lock_516 () from /lib64/libpthread.so.0 2 0x00007f1b80534fbb in pthread_mutex_lock () from /lib64/libpthread.so.0 3 0x00007f1b82a637cf in virMutexLock (m=0x7f1b680ceb80) at util/virthreadpthread.c:85 4 0x00007f1b82a4ccf2 in virObjectLock (anyobj=0x7f1b680ceb70) at util/virobject.c:320 5 0x00007f1b6ee13bd7 in virLXCProcessMonitorInitNotify (mon=0x7f1b3c007210, initpid=4832, vm=0x7f1b680ceb70) at lxc/lxc_process.c:601 6 0x00007f1b6ee11fd3 in virLXCMonitorHandleEventInit (prog=0x7f1b3c001f10, client=0x7f1b3c0038c0, evdata=0x7f1b8574a7d0, opaque=0x7f1b3c007210) at lxc/lxc_monitor.c:109 7 0x00007f1b82b8a196 in virNetClientProgramDispatch (prog=0x7f1b3c001f10, client=0x7f1b3c0038c0, msg=0x7f1b3c003928) at rpc/virnetclientprogram.c:259 8 0x00007f1b82b87030 in virNetClientCallDispatchMessage (client=0x7f1b3c0038c0) at rpc/virnetclient.c:1019 9 0x00007f1b82b876bb in virNetClientCallDispatch (client=0x7f1b3c0038c0) at rpc/virnetclient.c:1140 10 0x00007f1b82b87d41 in virNetClientIOHandleInput (client=0x7f1b3c0038c0) at rpc/virnetclient.c:1312 11 0x00007f1b82b88f51 in virNetClientIncomingEvent (sock=0x7f1b3c0044e0, events=1, opaque=0x7f1b3c0038c0) at rpc/virnetclient.c:1832 12 0x00007f1b82b9e1c8 in virNetSocketEventHandle (watch=3321, fd=54, events=1, opaque=0x7f1b3c0044e0) at rpc/virnetsocket.c:1695 13 0x00007f1b82a272cf in virEventPollDispatchHandles (nfds=21, fds=0x7f1b8574ded0) at util/vireventpoll.c:498 14 0x00007f1b82a27af2 in virEventPollRunOnce () at util/vireventpoll.c:645 15 0x00007f1b82a25a61 in virEventRunDefaultImpl () at util/virevent.c:273 16 0x00007f1b82b8e97e in virNetServerRun (srv=0x7f1b857419d0) at rpc/virnetserver.c:1097 17 0x00007f1b8359db6b in main (argc=2, argv=0x7ffff98dbaa8) at libvirtd.c:1512 --- src/lxc/lxc_monitor.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/src/lxc/lxc_monitor.c b/src/lxc/lxc_monitor.c index 999039b..07e9301 100644 --- a/src/lxc/lxc_monitor.c +++ b/src/lxc/lxc_monitor.c @@ -205,6 +205,9 @@ static void virLXCMonitorDispose(void *opaque) void virLXCMonitorClose(virLXCMonitorPtr mon) { + virDomainObjPtr vm; + virNetClientPtr client; + VIR_DEBUG("mon=%p", mon); if (mon->client) { /* When manually closing the monitor, we don't @@ -212,9 +215,18 @@ void virLXCMonitorClose(virLXCMonitorPtr mon) * the caller is not re-entrant safe */ VIR_DEBUG("Clear EOF callback mon=%p", mon); - mon->cb.eofNotify = NULL; - virNetClientClose(mon->client); - virObjectUnref(mon->client); + vm = mon->vm; + client = mon->client; mon->client = NULL; + mon->cb.eofNotify = NULL; + + virObjectRef(vm); + virObjectUnlock(vm); + + virNetClientClose(client); + virObjectUnref(client); + + virObjectLock(vm); + virObjectUnref(vm); } } -- 1.8.1.5

On Wed, Jul 24, 2013 at 12:15:32PM +0200, Michal Privoznik wrote:
There's a race in lxc driver causing a deadlock. If a domain is destroyed immediately after started, the deadlock can occur. When domain is started, the even loop tries to connect to the monitor. If the connecting succeeds, virLXCProcessMonitorInitNotify() is called with @mon->client locked. The first thing that callee does, is virObjectLock(vm). So the order of locking is: 1) @mon->client, 2) @vm.
However, if there's another thread executing virDomainDestroy on the very same domain, the first thing done here is locking the @vm. Then, the corresponding libvirt_lxc process is killed and monitor is closed via calling virLXCMonitorClose(). This callee tries to lock @mon->client too. So the order is reversed to the first case. This situation results in deadlock and unresponsive libvirtd (since the eventloop is involved).
The proper solution is to unlock the @vm in virLXCMonitorClose prior entering virNetClientClose(). See the backtrace as follows:
Hmm, I think I'd say that the flaw is in the way virLXCProcessMonitorInitNotify is invoked. In the QEMU driver monitor, we unlock the monitor before invoking any callbacks. In the LXC driver monitor we're invoking the callbacks with the monitor lock held. I think we need to make the LXC monitor locking wrt callbacks do what QEMU does, and unlock the monitor. See QEMU_MONITOR_CALLBACK in qemu_monitor.c Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 24.07.2013 12:29, Daniel P. Berrange wrote:
On Wed, Jul 24, 2013 at 12:15:32PM +0200, Michal Privoznik wrote:
There's a race in lxc driver causing a deadlock. If a domain is destroyed immediately after started, the deadlock can occur. When domain is started, the even loop tries to connect to the monitor. If the connecting succeeds, virLXCProcessMonitorInitNotify() is called with @mon->client locked. The first thing that callee does, is virObjectLock(vm). So the order of locking is: 1) @mon->client, 2) @vm.
However, if there's another thread executing virDomainDestroy on the very same domain, the first thing done here is locking the @vm. Then, the corresponding libvirt_lxc process is killed and monitor is closed via calling virLXCMonitorClose(). This callee tries to lock @mon->client too. So the order is reversed to the first case. This situation results in deadlock and unresponsive libvirtd (since the eventloop is involved).
The proper solution is to unlock the @vm in virLXCMonitorClose prior entering virNetClientClose(). See the backtrace as follows:
Hmm, I think I'd say that the flaw is in the way virLXCProcessMonitorInitNotify is invoked. In the QEMU driver monitor, we unlock the monitor before invoking any callbacks. In the LXC driver monitor we're invoking the callbacks with the monitor lock held. I think we need to make the LXC monitor locking wrt callbacks do what QEMU does, and unlock the monitor. See QEMU_MONITOR_CALLBACK in qemu_monitor.c
Daniel
I don't think so. It's not the monitor lock what is causing deadlock here. In fact, the monitor is unlocked: Thread 1 (Thread 0x7f35a348e740 (LWP 18839)): #0 0x00007f35a0481714 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f35a047d16c in _L_lock_516 () from /lib64/libpthread.so.0 #2 0x00007f35a047cfbb in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f35a29ab83f in virMutexLock (m=0x7f3588024e80) at util/virthreadpthread.c:85 #4 0x00007f35a2994d62 in virObjectLock (anyobj=0x7f3588024e70) at util/virobject.c:320 #5 0x00007f358ed5bbd7 in virLXCProcessMonitorInitNotify (mon=0x7f3560000ab0, initpid=29062, vm=0x7f3588024e70) at lxc/lxc_process.c:601 #6 0x00007f358ed59fd3 in virLXCMonitorHandleEventInit (prog=0x7f35600087b0, client=0x7f3560001fd0, evdata=0x7f35a53bc1e0, opaque=0x7f3560000ab0) at lxc/lxc_monitor.c:109 #7 0x00007f35a2ad2206 in virNetClientProgramDispatch (prog=0x7f35600087b0, client=0x7f3560001fd0, msg=0x7f3560002038) at rpc/virnetclientprogram.c:259 #8 0x00007f35a2acf0a0 in virNetClientCallDispatchMessage (client=0x7f3560001fd0) at rpc/virnetclient.c:1019 #9 0x00007f35a2acf72b in virNetClientCallDispatch (client=0x7f3560001fd0) at rpc/virnetclient.c:1140 #10 0x00007f35a2acfdb1 in virNetClientIOHandleInput (client=0x7f3560001fd0) at rpc/virnetclient.c:1312 #11 0x00007f35a2ad0fc1 in virNetClientIncomingEvent (sock=0x7f3560008350, events=1, opaque=0x7f3560001fd0) at rpc/virnetclient.c:1832 #12 0x00007f35a2ae6238 in virNetSocketEventHandle (watch=47, fd=40, events=1, opaque=0x7f3560008350) at rpc/virnetsocket.c:1695 #13 0x00007f35a296f33f in virEventPollDispatchHandles (nfds=22, fds=0x7f35a53bc7a0) at util/vireventpoll.c:498 #14 0x00007f35a296fb62 in virEventPollRunOnce () at util/vireventpoll.c:645 #15 0x00007f35a296dad1 in virEventRunDefaultImpl () at util/virevent.c:273 #16 0x00007f35a2ad69ee in virNetServerRun (srv=0x7f35a53b09d0) at rpc/virnetserver.c:1097 #17 0x00007f35a34e5b6b in main (argc=2, argv=0x7fffe188e778) at libvirtd.c:1512 (gdb) up #1 0x00007f35a047d16c in _L_lock_516 () from /lib64/libpthread.so.0 (gdb) up #2 0x00007f35a047cfbb in pthread_mutex_lock () from /lib64/libpthread.so.0 (gdb) #3 0x00007f35a29ab83f in virMutexLock (m=0x7f3588024e80) at util/virthreadpthread.c:85 85 pthread_mutex_lock(&m->lock); (gdb) #4 0x00007f35a2994d62 in virObjectLock (anyobj=0x7f3588024e70) at util/virobject.c:320 320 virMutexLock(&obj->lock); (gdb) #5 0x00007f358ed5bbd7 in virLXCProcessMonitorInitNotify (mon=0x7f3560000ab0, initpid=29062, vm=0x7f3588024e70) at lxc/lxc_process.c:601 601 virObjectLock(vm); (gdb) p *mon $1 = {parent = {parent = {magic = 3405643812, refs = 2, klass = 0x7f3588102cb0}, lock = {lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}}, vm = 0x7f3588024e70, cb = {destroy = 0x0, eofNotify = 0x0, exitNotify = 0x7f358ed5b928 <virLXCProcessMonitorExitNotify>, initNotify = 0x7f358ed5bb8b <virLXCProcessMonitorInitNotify>}, client = 0x7f3560001fd0, program = 0x7f35600087b0} (gdb) Michal

On Wed, Jul 24, 2013 at 01:43:02PM +0200, Michal Privoznik wrote:
On 24.07.2013 12:29, Daniel P. Berrange wrote:
On Wed, Jul 24, 2013 at 12:15:32PM +0200, Michal Privoznik wrote:
There's a race in lxc driver causing a deadlock. If a domain is destroyed immediately after started, the deadlock can occur. When domain is started, the even loop tries to connect to the monitor. If the connecting succeeds, virLXCProcessMonitorInitNotify() is called with @mon->client locked. The first thing that callee does, is virObjectLock(vm). So the order of locking is: 1) @mon->client, 2) @vm.
However, if there's another thread executing virDomainDestroy on the very same domain, the first thing done here is locking the @vm. Then, the corresponding libvirt_lxc process is killed and monitor is closed via calling virLXCMonitorClose(). This callee tries to lock @mon->client too. So the order is reversed to the first case. This situation results in deadlock and unresponsive libvirtd (since the eventloop is involved).
The proper solution is to unlock the @vm in virLXCMonitorClose prior entering virNetClientClose(). See the backtrace as follows:
Hmm, I think I'd say that the flaw is in the way virLXCProcessMonitorInitNotify is invoked. In the QEMU driver monitor, we unlock the monitor before invoking any callbacks. In the LXC driver monitor we're invoking the callbacks with the monitor lock held. I think we need to make the LXC monitor locking wrt callbacks do what QEMU does, and unlock the monitor. See QEMU_MONITOR_CALLBACK in qemu_monitor.c
Daniel
I don't think so. It's not the monitor lock what is causing deadlock here. In fact, the monitor is unlocked:
Oh, I was mixing the monitor vs virnetclient locks up. ACK to your original patch. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 07/24/2013 04:15 AM, Michal Privoznik wrote:
There's a race in lxc driver causing a deadlock. If a domain is destroyed immediately after started, the deadlock can occur. When domain is started, the even loop tries to connect to the monitor. If the
s/even/event/
connecting succeeds, virLXCProcessMonitorInitNotify() is called with @mon->client locked. The first thing that callee does, is virObjectLock(vm). So the order of locking is: 1) @mon->client, 2) @vm.
-- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
participants (3)
-
Daniel P. Berrange
-
Eric Blake
-
Michal Privoznik