
On 10/13/2010 09:11 AM, Daniel P. Berrange wrote:
On Thu, Oct 07, 2010 at 09:58:28AM -0400, Stefan Berger wrote:
On 10/07/2010 09:06 AM, Soren Hansen wrote:
I had trouble applying the patch (I think maybe Thunderbird may have fiddled with the formatting :( ), but after doing it manually, it works excellently. Thanks!
Great. I will prepare a V3.
I am also shooting a kill -SIGHUP at libvirt once in a while to see what happens (while creating / destroying 2 VMs and modifying their filters). Most of the time all goes well, but occasionally things do get stuck. I get the following debugging output from libvirt and attaching gdb to libvirt I see the following stack traces. Maybe Daniel can interpret this... To me it looks like some of the conditions need to be 'tickled'...
(gdb) thr ap all bt
Thread 9 (Thread 0x7f49bf592710 (LWP 17464)): #0 0x000000327680b729 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000435312 in virCondWaitUntil (c=<value optimized out>, m=<value optimized out>, whenms=<value optimized out>) at util/threads-pthread.c:115 #2 0x000000000043d0ab in qemuDomainObjBeginJobWithDriver (driver=0x1f9c010, obj=0x7f49a00011b0) at qemu/qemu_driver.c:409 #3 0x0000000000458abf in qemuAutostartDomain (payload=<value optimized out>, name=<value optimized out>, opaque=0x7f49bf591320) at qemu/qemu_driver.c:818 #4 0x00007f49c040ab6a in virHashForEach (table=0x1f9be20, iter=0x458a90<qemuAutostartDomain>, data=0x7f49bf591320) at util/hash.c:495 #5 0x000000000043cdac in qemudAutostartConfigs (driver=0x1f9c010) at qemu/qemu_driver.c:855 #6 0x000000000043ce2a in qemudReload () at qemu/qemu_driver.c:2003 #7 0x00007f49c0450a3e in virStateReload () at libvirt.c:1017 #8 0x00000000004189e1 in qemudDispatchSignalEvent ( watch=<value optimized out>, fd=<value optimized out>, events=<value optimized out>, opaque=0x1f6f830) at libvirtd.c:388 ---Type<return> to continue, or q<return> to quit--- #9 0x00000000004186a9 in virEventDispatchHandles () at event.c:479 #10 virEventRunOnce () at event.c:608 #11 0x000000000041a346 in qemudOneLoop () at libvirtd.c:2217 #12 0x000000000041a613 in qemudRunLoop (opaque=0x1f6f830) at libvirtd.c:2326 #13 0x0000003276807761 in start_thread () from /lib64/libpthread.so.0 #14 0x00000032760e14ed in clone () from /lib64/libc.so.6
This thread shows the problem. Guests must not be run directly from the event loop thread, because startup requires waiting for I/O events. So this thread is sitting on the condition variable waiting for an I/O event to complete, but because its doing this from the event loop thread the event loop isn't running. So the condition will never be signalled. This is completely unrelated to the other problems discussed in this thread& I'm surprised we've not seen it before now!
Yes, it's unrelated and came up through my testing of the code paths I touched with the deadlock-prevention patch...
When you send SIGHUP to libvirt this triggers a reload of the guest domain configs. For some reason we also have this SIGHUP re-triggering autostart. IMHO this is a very big mistake. If a guest is marked as autostart, I don't think an admin would expect it to be started when just sending SIGHUP. I think we should fix it so that autostart is only ever done at daemon startup, not SIGHUP. This would avoid the entire problem code path here
FWIW, I don't have any VM marked as 'autostart', but the code seems to be doing something 'for' VMs no matter whether they are marked as autostart or not, i.e., run 'qemuDomainObjBeginJobWithDriver' Stefan