[libvirt] Fwd: xen/libvirt freeze while attching network-device to vm´s

[edited to reduce size to get past moderation] -------- Forwarded Message -------- Date: Wed, 4 Nov 2015 15:57:26 +0100 From: max ustermann <ustermann78@web.de> To: libvirt-users@redhat.com Subject: xen/libvirt freeze while attching network-device to vm´s Message-ID: <20151104155726.72fd6109@linux-3go9> Hi, first sorry for the cross-posting, but i don Ìt known exactly if my problem belongs to xen or libvirt. I have an VM-Host with Xen 4.6.0 und libvirt 1.2.19. The Job of the Host is to resume in parallel 20 VM Ìs (HVM WinXp) attach a block-device to each VM , wait 60 Seconds, kill the VM and start from beginning. For that, a python script create for each vm a thread and each thread get a connection to the Libvirt-Daemon via "libvirt.open(None)" and do the resume, attach and so on in a loop. This Setup works fine, it do several thousand cycle without any problems. If i complement this setup, so that each vm get , additonal to the blockdevice an network-interface after resume, via "attachDevice" call from the python-script, i observe that libvirtd/xen freeze after 20 to 100 cycle. Their are no error-messages, no crash, noting in "dmesg" or "xl dmesg". When i attached to the libvirtd via gdb and do a "info thread", i get the following: Id Target Id Frame 21 Thread 0x7f792a92c700 (LWP 1312) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 20 Thread 0x7f792a12b700 (LWP 1313) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 19 Thread 0x7f792992a700 (LWP 1314) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 18 Thread 0x7f7929129700 (LWP 1315) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 17 Thread 0x7f7928928700 (LWP 1316) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 16 Thread 0x7f791bfff700 (LWP 1317) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 15 Thread 0x7f791b7fe700 (LWP 1318) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 14 Thread 0x7f791affd700 (LWP 1319) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 13 Thread 0x7f791a7fc700 (LWP 1320) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 12 Thread 0x7f7919ffb700 (LWP 1321) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 11 Thread 0x7f78fd6ff700 (LWP 1334) "libvirtd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 10 Thread 0x7f78fcefe700 (LWP 1335) "libvirtd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 9 Thread 0x7f78f3fff700 (LWP 1336) "libvirtd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 8 Thread 0x7f78ebfff700 (LWP 1337) "libvirtd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 7 Thread 0x7f78f37fe700 (LWP 1338) "libvirtd" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 6 Thread 0x7f793bf94700 (LWP 1494) "libvirtd" 0x00007f79375c87fd in read () at ../sysdeps/unix/syscall-template.S:81 5 Thread 0x7f78f241d700 (LWP 17709) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 4 Thread 0x7f78f1c1c700 (LWP 17717) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 3 Thread 0x7f78f141b700 (LWP 17722) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 2 Thread 0x7f78f0c1a700 (LWP 17751) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 * 1 Thread 0x7f793c0428c0 (LWP 1269) "libvirtd" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 i start to examine at thread 1 and get the fowlling bt: #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f79375c411b in _L_lock_812 () from /lib64/libpthread.so.0 #2 0x00007f79375c3fe8 in __GI___pthread_mutex_lock (mutex=0x7f78d80edd30) at ../nptl/pthread_mutex_lock.c:79 #3 0x00007f793b09cd0c in virMutexLock (m=0x7f78d80edd30) at util/virthread.c:89 #4 0x00007f793b07d831 in virObjectLock (anyobj=0x7f78d80edd20) at util/virobject.c:323 #5 0x00007f793b0b71d7 in virDomainObjListFindByID (doms=0x7f78d80edd20, id=1) at conf/domain_conf.c:1237 #6 0x00007f78fe3f7eb3 in libxlDomainEventHandler (data=0x7f78d81ce940, event=0x7f793c30c2e0) at libxl/libxl_domain.c:534 #7 0x00007f793806f7c0 in egc_run_callbacks (egc=egc@entry=0x7ffdb79c7490) at libxl_event.c:1405 #8 0x00007f793806fb9d in libxl__egc_cleanup (egc=0x7ffdb79c7490) at libxl_event.c:1436 #9 libxl_osevent_occurred_fd (ctx=<optimized out>, for_libxl=0x7f78d80ec220, fd=27, events_ign=<optimized out>, revents_ign=<optimized out>) at libxl_event.c:1339 #10 0x00007f78fe3f9ac5 in libxlFDEventCallback (watch=11, fd=27, vir_events=1, fd_info=0x7f78d811c430) at libxl/libxl_driver.c:134 #11 0x00007f793b04a9c4 in virEventPollDispatchHandles (nfds=31, fds=0x7f793c30b510) at util/vireventpoll.c:509 #12 0x00007f793b04b224 in virEventPollRunOnce () at util/vireventpoll.c:658 #13 0x00007f793b0490cd in virEventRunDefaultImpl () at util/virevent.c:308 #14 0x00007f793b1f5ab6 in virNetDaemonRun (dmn=0x7f793c2e1b80) at rpc/virnetdaemon.c:701 #15 0x00007f793be0bc27 in main (argc=1, argv=0x7ffdb79c7d78) at libvirtd.c:1577 i select frame 3 and look who holds the lock: (gdb) frame 3 #3 0x00007f793b09cd0c in virMutexLock (m=0x7f78d80edd30) at util/virthread.c:89 89 pthread_mutex_lock(&m->lock); (gdb) p *m $5 = {lock = {__data = {__lock = 2, __count = 0, __owner = 1312, __nusers = 1, __kind = 512, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000 \005\000\000\001\000\000\000\000\002", '\000' <repeats 21 times>, __align = 2}} i switch to frame 21 (LWP1312) i taek a look to the bt: (gdb) bt #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f79375c411b in _L_lock_812 () from /lib64/libpthread.so.0 #2 0x00007f79375c3fe8 in __GI___pthread_mutex_lock (mutex=0x7f79140099d0) at ../nptl/pthread_mutex_lock.c:79 #3 0x00007f793b09cd0c in virMutexLock (m=0x7f79140099d0) at util/virthread.c:89 #4 0x00007f793b07d831 in virObjectLock (anyobj=0x7f79140099c0) at util/virobject.c:323 #5 0x00007f793b0b717f in virDomainObjListSearchID (payload=0x7f79140099c0, name=0x7f791400f7c0, data=0x7f792a92b464) at conf/domain_conf.c:1225 #6 0x00007f793b055f1e in virHashSearch (ctable=0x7f78d81ccc40, iter=0x7f793b0b7148 <virDomainObjListSearchID>, data=0x7f792a92b464) at util/virhash.c:724 #7 0x00007f793b0b71f2 in virDomainObjListFindByID (doms=0x7f78d80edd20, id=2) at conf/domain_conf.c:1238 #8 0x00007f78fe3f7eb3 in libxlDomainEventHandler (data=0x7f78d81ce940, event=0x7f791c005610) at libxl/libxl_domain.c:534 #9 0x00007f793806f7c0 in egc_run_callbacks (egc=egc@entry=0x7f792a92b580) at libxl_event.c:1405 #10 0x00007f7938070f8b in libxl__egc_cleanup (egc=0x7f792a92b580) at libxl_event.c:1436 #11 libxl__ao_inprogress (ao=ao@entry=0x7f791c00c6e0, file=file@entry=0x7f793808d512 "libxl.c", line=line@entry=4226, func=func@entry=0x7f7938090150 <__func__.20288> "libxl_device_nic_add") at libxl_event.c:2014 #12 0x00007f7938048fa6 in libxl_device_nic_add (ctx=0x7f78d8112830, domid=19, nic=0x7f792a92b6b0, ao_how=<optimized out>) at libxl.c:4226 #13 0x00007f78fe402baa in libxlDomainAttachNetDevice (driver=0x7f78d81ce940, vm=0x7f79140099c0, net=0x7f791c003610) at libxl/libxl_driver.c:3199 #14 0x00007f78fe402d12 in libxlDomainAttachDeviceLive (driver=0x7f78d81ce940, vm=0x7f79140099c0, dev=0x7f791c001350) at libxl/libxl_driver.c:3231 #15 0x00007f78fe404135 in libxlDomainAttachDeviceFlags (dom=0x7f791c009150, xml=0x7f791c009230 "<interface type=\"bridge\">\n <source bridge=\"xenbr0\" />\n <model type=\"rtl8139\" />\n <mac address=\"00:27:0e:0c:01:09\" />\n </interface>", flags=1) at libxl/libxl_driver.c:3707 #16 0x00007f78fe404239 in libxlDomainAttachDevice (dom=0x7f791c009150, xml=0x7f791c009230 "<interface type=\"bridge\">\n <source bridge=\"xenbr0\" />\n <model type=\"rtl8139\" />\n <mac address=\"00:27:0e:0c:01:09\" />\n </interface>") at libxl/libxl_driver.c:3745 #17 0x00007f793b18422c in virDomainAttachDevice (domain=0x7f791c009150, xml=0x7f791c009230 "<interface type=\"bridge\">\n <source bridge=\"xenbr0\" />\n <model type=\"rtl8139\" />\n <mac address=\"00:27:0e:0c:01:09\" />\n </interface>") at libvirt-domain.c:8351 #18 0x00007f793be11a1f in remoteDispatchDomainAttachDevice (server=0x7f793c2e0a90, client=0x7f793c2f34c0, msg=0x7f793c2feca0, rerr=0x7f792a92bab0, args=0x7f791c009190) at remote_dispatch.h:2850 #19 0x00007f793be11923 in remoteDispatchDomainAttachDeviceHelper (server=0x7f793c2e0a90, client=0x7f793c2f34c0, msg=0x7f793c2feca0, rerr=0x7f792a92bab0, args=0x7f791c009190, ret=0x7f791c009130) at remote_dispatch.h:2826 #20 0x00007f793b1fccfc in virNetServerProgramDispatchCall (prog=0x7f793c2eefd0, server=0x7f793c2e0a90, client=0x7f793c2f34c0, msg=0x7f793c2feca0) at rpc/virnetserverprogram.c:437 #21 0x00007f793b1fc863 in virNetServerProgramDispatch (prog=0x7f793c2eefd0, server=0x7f793c2e0a90, client=0x7f793c2f34c0, msg=0x7f793c2feca0) at rpc/virnetserverprogram.c:307 #22 0x00007f793b1f5e6d in virNetServerProcessMsg (srv=0x7f793c2e0a90, client=0x7f793c2f34c0, prog=0x7f793c2eefd0, msg=0x7f793c2feca0) at rpc/virnetserver.c:135 #23 0x00007f793b1f5f3a in virNetServerHandleJob (jobOpaque=0x7f793c302c60, opaque=0x7f793c2e0a90) at rpc/virnetserver.c:156 #24 0x00007f793b09d9f7 in virThreadPoolWorker (opaque=0x7f793c2e0750) at util/virthreadpool.c:145 #25 0x00007f793b09d00a in virThreadHelper (data=0x7f793c2eced0) at util/virthread.c:206 #26 0x00007f79375c1ee5 in start_thread (arg=0x7f792a92c700) at pthread_create.c:309 #27 0x00007f79372f0d1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 here i select frame 3 and look who holds the lock: (gdb) frame 3 #3 0x00007f793b09cd0c in virMutexLock (m=0x7f79140099d0) at util/virthread.c:89 89 pthread_mutex_lock(&m->lock); $6 = {lock = {__data = {__lock = 2, __count = 0, __owner = 1312, __nusers = 1, __kind = 512, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000 \005\000\000\001\000\000\000\000\002", '\000' <repeats 21 times>, __align = 2}} so i see that thread 21 waits for a lock that holds by thread 21 from frame 4 i see, that the process want a lock for vm object: #4 0x00007f793b07d831 in virObjectLock (anyobj=0x7f79140099c0) at util/virobject.c:323 further down in the bt i see in frame 14: #14 0x00007f78fe402d12 in libxlDomainAttachDeviceLive (driver=0x7f78d81ce940, vm=0x7f79140099c0, dev=0x7f791c001350) at libxl/libxl_driver.c:3231 that vm-object 0x7f79140099c0 is the object for that also a network-device should be attached. And from that i know that there is a lock on the vm-object. so i go up in the bt an see that libxlDomainAttachDeviceLive(libvirtd) (frame14) calls libxlDomainAttachNetDevice(libvirtd) (frame13), that calls libxl_device_nic_add(libxl/xen) (frame12) from info frame for frame12 i see that he calls frame 0x7f792a92b5f0: (gdb) info f Stack level 12, frame at 0x7f792a92b670: rip = 0x7f7938048fa6 in libxl_device_nic_add (libxl.c:4226); saved rip = 0x7f78fe402baa called by frame at 0x7f792a92b760, caller of frame at 0x7f792a92b5f0 source language c. Arglist at 0x7f792a92b5e8, args: ctx=0x7f78d8112830, domid=19, nic=0x7f792a92b6b0, ao_how=<optimized out> Locals at 0x7f792a92b5e8, Previous frame's sp is 0x7f792a92b670 Saved registers: rbx at 0x7f792a92b640, rbp at 0x7f792a92b648, r12 at 0x7f792a92b650, r13 at 0x7f792a92b658, r14 at 0x7f792a92b660, rip at 0x7f792a92b668 and this is frame 11 libxl__ao_inprogress (ao=ao@entry=0x7f791c00c6e0, file=file@entry=0x7f793808d512 "libxl.c", line=line@entry=4226, func=func@entry=0x7f7938090150 <__func__.20288> "libxl_device_nic_add") at libxl_event.c:2014 so further i see that libxl__ao_inprogress calls libxl__egc_cleanup and that calls egc_run_callbacks, frame 9 info frame for frame 9 gives: (gdb) info f Stack level 9, frame at 0x7f792a92b540: rip = 0x7f793806f7c0 in egc_run_callbacks (libxl_event.c:1405); saved rip = 0x7f7938070f8b called by frame at 0x7f792a92b5f0, caller of frame at 0x7f792a92b500 source language c. Arglist at 0x7f792a92b4f8, args: egc=egc@entry=0x7f792a92b580 Locals at 0x7f792a92b4f8, Previous frame's sp is 0x7f792a92b540 Saved registers: rbx at 0x7f792a92b520, rbp at 0x7f792a92b528, r12 at 0x7f792a92b530, rip at 0x7f792a92b538 and list gives: (gdb) list 1400 1401 LIBXL_TAILQ_FOREACH_SAFE(ev, &egc->occurred_for_callback, link, ev_tmp) { 1402 LIBXL_TAILQ_REMOVE(&egc->occurred_for_callback, ev, link); 1403 LOG(DEBUG,"event %p callback type=%s", 1404 ev, libxl_event_type_to_string(ev->type)); 1405 CTX->event_hooks->event_occurs(CTX->event_hooks_user, ev); 1406 } 1407 1408 LIBXL_TAILQ_FOREACH_SAFE(aop, &egc->aops_for_callback, entry, aop_tmp) { 1409 LIBXL_TAILQ_REMOVE(&egc->aops_for_callback, aop, entry); from here i see that egc_run_callbacks is at line 1405 " CTX->event_hooks->event_occurs(CTX->event_hooks_user, ev);" , but i have no idea what he is doing there and while it "hangs" here!?! if i see it correctly, the systems freeze/hangs because it waits of the release of the lock for the vm which actuall attach a network-device, and the process of attaching the network-device didn Ìt finished and hold the lock. as i say at the geginning of my mail, if i only attach a block-device to the vm`s every thing runs, and with additional attaching a network-device after 20 to 100 resume , it freeze´s. i am gratefull for any help,hints and tips for further debugging. all the best max for completness , here are the output of "(gdb) thread apply all bt full": [trimmed by moderator]

On 11/04/2015 08:29 AM, Eric Blake wrote:
[edited to reduce size to get past moderation]
-------- Forwarded Message --------
Date: Wed, 4 Nov 2015 15:57:26 +0100 From: max ustermann <ustermann78@web.de> To: libvirt-users@redhat.com Subject: xen/libvirt freeze while attching network-device to vm´s Message-ID: <20151104155726.72fd6109@linux-3go9>
Hi,
first sorry for the cross-posting, but i don Ìt known exactly if my problem belongs to xen or libvirt.
AFAICT, this was only posted to libvirt-users, but will respond to Eric's forwarded message since the issue is more suited for the libvirt dev list. First, thanks for the report and detailed info.
I have an VM-Host with Xen 4.6.0 und libvirt 1.2.19. The Job of the Host is to resume in parallel 20 VM Ìs (HVM WinXp) attach a block-device to each VM , wait 60 Seconds, kill the VM and start from beginning. For that, a python script create for each vm a thread and each thread get a connection to the Libvirt-Daemon via "libvirt.open(None)" and do the resume, attach and so on in a loop.
This Setup works fine, it do several thousand cycle without any problems. If i complement this setup, so that each vm get , additonal to the blockdevice an network-interface after resume, via "attachDevice" call from the python-script, i observe that libvirtd/xen freeze after 20 to 100 cycle. Their are no error-messages, no crash, noting in "dmesg" or "xl dmesg".
When i attached to the libvirtd via gdb [...] (gdb) bt #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f79375c411b in _L_lock_812 () from /lib64/libpthread.so.0 #2 0x00007f79375c3fe8 in __GI___pthread_mutex_lock (mutex=0x7f79140099d0) at ../nptl/pthread_mutex_lock.c:79 #3 0x00007f793b09cd0c in virMutexLock (m=0x7f79140099d0) at util/virthread.c:89 #4 0x00007f793b07d831 in virObjectLock (anyobj=0x7f79140099c0) at util/virobject.c:323 #5 0x00007f793b0b717f in virDomainObjListSearchID (payload=0x7f79140099c0, name=0x7f791400f7c0, data=0x7f792a92b464) at conf/domain_conf.c:1225 #6 0x00007f793b055f1e in virHashSearch (ctable=0x7f78d81ccc40, iter=0x7f793b0b7148 <virDomainObjListSearchID>, data=0x7f792a92b464) at util/virhash.c:724 #7 0x00007f793b0b71f2 in virDomainObjListFindByID (doms=0x7f78d80edd20, id=2) at conf/domain_conf.c:1238 #8 0x00007f78fe3f7eb3 in libxlDomainEventHandler (data=0x7f78d81ce940, event=0x7f791c005610) at libxl/libxl_domain.c:534 #9 0x00007f793806f7c0 in egc_run_callbacks (egc=egc@entry=0x7f792a92b580) at libxl_event.c:1405 #10 0x00007f7938070f8b in libxl__egc_cleanup (egc=0x7f792a92b580) at libxl_event.c:1436 #11 libxl__ao_inprogress (ao=ao@entry=0x7f791c00c6e0, file=file@entry=0x7f793808d512 "libxl.c", line=line@entry=4226, func=func@entry=0x7f7938090150 <__func__.20288> "libxl_device_nic_add") at libxl_event.c:2014 #12 0x00007f7938048fa6 in libxl_device_nic_add (ctx=0x7f78d8112830, domid=19, nic=0x7f792a92b6b0, ao_how=<optimized out>) at libxl.c:4226 #13 0x00007f78fe402baa in libxlDomainAttachNetDevice (driver=0x7f78d81ce940, vm=0x7f79140099c0, net=0x7f791c003610) at libxl/libxl_driver.c:3199 #14 0x00007f78fe402d12 in libxlDomainAttachDeviceLive (driver=0x7f78d81ce940, vm=0x7f79140099c0, dev=0x7f791c001350) at libxl/libxl_driver.c:3231 #15 0x00007f78fe404135 in libxlDomainAttachDeviceFlags (dom=0x7f791c009150, xml=0x7f791c009230 "<interface type=\"bridge\">\n <source bridge=\"xenbr0\" />\n <model type=\"rtl8139\" />\n <mac address=\"00:27:0e:0c:01:09\" />\n </interface>", flags=1) at libxl/libxl_driver.c:3707 #16 0x00007f78fe404239 in libxlDomainAttachDevice (dom=0x7f791c009150, xml=0x7f791c009230 "<interface type=\"bridge\">\n <source bridge=\"xenbr0\" />\n <model type=\"rtl8139\" />\n <mac address=\"00:27:0e:0c:01:09\" />\n </interface>") at libxl/libxl_driver.c:3745 #17 0x00007f793b18422c in virDomainAttachDevice (domain=0x7f791c009150, xml=0x7f791c009230 "<interface type=\"bridge\">\n <source bridge=\"xenbr0\" />\n <model type=\"rtl8139\" />\n <mac address=\"00:27:0e:0c:01:09\" />\n </interface>") at libvirt-domain.c:8351 #18 0x00007f793be11a1f in remoteDispatchDomainAttachDevice (server=0x7f793c2e0a90, client=0x7f793c2f34c0, msg=0x7f793c2feca0, rerr=0x7f792a92bab0, args=0x7f791c009190) at remote_dispatch.h:2850 #19 0x00007f793be11923 in remoteDispatchDomainAttachDeviceHelper (server=0x7f793c2e0a90, client=0x7f793c2f34c0, msg=0x7f793c2feca0, rerr=0x7f792a92bab0, args=0x7f791c009190, ret=0x7f791c009130) at remote_dispatch.h:2826 #20 0x00007f793b1fccfc in virNetServerProgramDispatchCall (prog=0x7f793c2eefd0, server=0x7f793c2e0a90, client=0x7f793c2f34c0, msg=0x7f793c2feca0) at rpc/virnetserverprogram.c:437 #21 0x00007f793b1fc863 in virNetServerProgramDispatch (prog=0x7f793c2eefd0, server=0x7f793c2e0a90, client=0x7f793c2f34c0, msg=0x7f793c2feca0) at rpc/virnetserverprogram.c:307 #22 0x00007f793b1f5e6d in virNetServerProcessMsg (srv=0x7f793c2e0a90, client=0x7f793c2f34c0, prog=0x7f793c2eefd0, msg=0x7f793c2feca0) at rpc/virnetserver.c:135 #23 0x00007f793b1f5f3a in virNetServerHandleJob (jobOpaque=0x7f793c302c60, opaque=0x7f793c2e0a90) at rpc/virnetserver.c:156 #24 0x00007f793b09d9f7 in virThreadPoolWorker (opaque=0x7f793c2e0750) at util/virthreadpool.c:145 #25 0x00007f793b09d00a in virThreadHelper (data=0x7f793c2eced0) at util/virthread.c:206 #26 0x00007f79375c1ee5 in start_thread (arg=0x7f792a92c700) at pthread_create.c:309 #27 0x00007f79372f0d1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
As you described below, the virDomainObj is locked in frame#15 in libxlDomainAttachDeviceFlags(). In frame#12, libxl_device_nic_add() is called, which may also process pending events by calling a registered 'event_occurs' handler. In this case it's libxlDomainEventHandler() in frame#8. I suspect the event is 'shutdown', caused by one of your other threads killing off a domain. Actually, it has to be a shutdown event since libxlDomainEventHandler() will simply free the event and return when it is not shutdown. libxlDomainEventHandler() calls virDomainObjListFindByID() to retrieve the virDomainObj associated with the domain ID provided by libxl. Searching the domain obj list results in locking each obj before checking if it is active and its ID equals the requested ID. And here is where we deadlock. One solution that comes to mind is processing events in a thread using libxl_event_wait(), instead of registering an event handler with libxl. Events could then be processed asynchronously in libvirt, avoiding the deadlock. I thought about this in the past, when changing the libxl driver to use a single libxl_ctx, but was deterred by the following comment about libxl_event_wait() in libxl_event.h /* Like libxl_event_check but blocks if no suitable events are * available, until some are. Uses libxl_osevent_beforepoll/ * _afterpoll so may be inefficient if very many domains are being * handled by a single program. */ Ian, you've become a bit familiar with libvirt's use of libxl over the years. Would you recommend using libxl_event_wait() instead of the current event handler registration approach? BTW, I think this deadlock is a specific instance of a more general problem in the libvirt libxl driver. Seems any function that has the virDomainObj locked and calls a libxl_* function that also processes pending events can potentially deadlock. Regards, Jim
here i select frame 3 and look who holds the lock: (gdb) frame 3 #3 0x00007f793b09cd0c in virMutexLock (m=0x7f79140099d0) at util/virthread.c:89 89 pthread_mutex_lock(&m->lock); $6 = {lock = {__data = {__lock = 2, __count = 0, __owner = 1312, __nusers = 1, __kind = 512, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000 \005\000\000\001\000\000\000\000\002", '\000' <repeats 21 times>, __align = 2}}
so i see that thread 21 waits for a lock that holds by thread 21
from frame 4 i see, that the process want a lock for vm object: #4 0x00007f793b07d831 in virObjectLock (anyobj=0x7f79140099c0) at util/virobject.c:323
further down in the bt i see in frame 14: #14 0x00007f78fe402d12 in libxlDomainAttachDeviceLive (driver=0x7f78d81ce940, vm=0x7f79140099c0, dev=0x7f791c001350) at libxl/libxl_driver.c:3231 that vm-object 0x7f79140099c0 is the object for that also a network-device should be attached. And from that i know that there is a lock on the vm-object.
so i go up in the bt an see that libxlDomainAttachDeviceLive(libvirtd) (frame14) calls libxlDomainAttachNetDevice(libvirtd) (frame13), that calls libxl_device_nic_add(libxl/xen) (frame12)
from info frame for frame12 i see that he calls frame 0x7f792a92b5f0: (gdb) info f Stack level 12, frame at 0x7f792a92b670: rip = 0x7f7938048fa6 in libxl_device_nic_add (libxl.c:4226); saved rip = 0x7f78fe402baa called by frame at 0x7f792a92b760, caller of frame at 0x7f792a92b5f0 source language c. Arglist at 0x7f792a92b5e8, args: ctx=0x7f78d8112830, domid=19, nic=0x7f792a92b6b0, ao_how=<optimized out> Locals at 0x7f792a92b5e8, Previous frame's sp is 0x7f792a92b670 Saved registers: rbx at 0x7f792a92b640, rbp at 0x7f792a92b648, r12 at 0x7f792a92b650, r13 at 0x7f792a92b658, r14 at 0x7f792a92b660, rip at 0x7f792a92b668
and this is frame 11 libxl__ao_inprogress (ao=ao@entry=0x7f791c00c6e0, file=file@entry=0x7f793808d512 "libxl.c", line=line@entry=4226, func=func@entry=0x7f7938090150 <__func__.20288> "libxl_device_nic_add") at libxl_event.c:2014
so further i see that libxl__ao_inprogress calls libxl__egc_cleanup and that calls egc_run_callbacks, frame 9 info frame for frame 9 gives: (gdb) info f Stack level 9, frame at 0x7f792a92b540: rip = 0x7f793806f7c0 in egc_run_callbacks (libxl_event.c:1405); saved rip = 0x7f7938070f8b called by frame at 0x7f792a92b5f0, caller of frame at 0x7f792a92b500 source language c. Arglist at 0x7f792a92b4f8, args: egc=egc@entry=0x7f792a92b580 Locals at 0x7f792a92b4f8, Previous frame's sp is 0x7f792a92b540 Saved registers: rbx at 0x7f792a92b520, rbp at 0x7f792a92b528, r12 at 0x7f792a92b530, rip at 0x7f792a92b538
and list gives: (gdb) list 1400 1401 LIBXL_TAILQ_FOREACH_SAFE(ev, &egc->occurred_for_callback, link, ev_tmp) { 1402 LIBXL_TAILQ_REMOVE(&egc->occurred_for_callback, ev, link); 1403 LOG(DEBUG,"event %p callback type=%s", 1404 ev, libxl_event_type_to_string(ev->type)); 1405 CTX->event_hooks->event_occurs(CTX->event_hooks_user, ev); 1406 } 1407 1408 LIBXL_TAILQ_FOREACH_SAFE(aop, &egc->aops_for_callback, entry, aop_tmp) { 1409 LIBXL_TAILQ_REMOVE(&egc->aops_for_callback, aop, entry);
from here i see that egc_run_callbacks is at line 1405 " CTX->event_hooks->event_occurs(CTX->event_hooks_user, ev);" , but i have no idea what he is doing there and while it "hangs" here!?!
if i see it correctly, the systems freeze/hangs because it waits of the release of the lock for the vm which actuall attach a network-device, and the process of attaching the network-device didn Ìt finished and hold the lock. as i say at the geginning of my mail, if i only attach a block-device to the vm`s every thing runs, and with additional attaching a network-device after 20 to 100 resume , it freeze´s.
i am gratefull for any help,hints and tips for further debugging.
all the best max
for completness , here are the output of "(gdb) thread apply all bt full":
[trimmed by moderator]
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
participants (2)
-
Eric Blake
-
Jim Fehlig