Re: [libvirt] [PATCH] [RFC] nwfilter: resolve deadlock between VM operations and filter update

13 Oct 2010


      On 10/13/2010 09:11 AM, Daniel P. Berrange wrote:
...
On Thu, Oct 07, 2010 at 09:58:28AM -0400, Stefan Berger wrote:
...
On 10/07/2010 09:06 AM, Soren Hansen wrote:
...
I had trouble applying the patch (I think maybe Thunderbird may have
fiddled with the formatting :( ), but after doing it manually, it works
excellently. Thanks!
Great. I will prepare a V3.
I am also shooting a kill -SIGHUP at libvirt once in a while to see what
happens (while creating / destroying 2 VMs and modifying their filters).
Most of the time all goes well, but occasionally things do get stuck. I
get the following debugging output from libvirt and attaching gdb to
libvirt I see the following stack traces. Maybe Daniel can interpret
this... To me it looks like some of the conditions need to be 'tickled'...
(gdb) thr ap all bt
Thread 9 (Thread 0x7f49bf592710 (LWP 17464)):
#0  0x000000327680b729 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
    from /lib64/libpthread.so.0
#1  0x0000000000435312 in virCondWaitUntil (c=<value optimized out>,
     m=<value optimized out>, whenms=<value optimized out>)
     at util/threads-pthread.c:115
#2  0x000000000043d0ab in qemuDomainObjBeginJobWithDriver
(driver=0x1f9c010,
     obj=0x7f49a00011b0) at qemu/qemu_driver.c:409
#3  0x0000000000458abf in qemuAutostartDomain (payload=<value optimized
out>,
     name=<value optimized out>, opaque=0x7f49bf591320)
     at qemu/qemu_driver.c:818
#4  0x00007f49c040ab6a in virHashForEach (table=0x1f9be20,
     iter=0x458a90<qemuAutostartDomain>, data=0x7f49bf591320)
     at util/hash.c:495
#5  0x000000000043cdac in qemudAutostartConfigs (driver=0x1f9c010)
     at qemu/qemu_driver.c:855
#6  0x000000000043ce2a in qemudReload () at qemu/qemu_driver.c:2003
#7  0x00007f49c0450a3e in virStateReload () at libvirt.c:1017
#8  0x00000000004189e1 in qemudDispatchSignalEvent (
     watch=<value optimized out>, fd=<value optimized out>,
     events=<value optimized out>, opaque=0x1f6f830) at libvirtd.c:388
---Type<return>  to continue, or q<return>  to quit---
#9  0x00000000004186a9 in virEventDispatchHandles () at event.c:479
#10 virEventRunOnce () at event.c:608
#11 0x000000000041a346 in qemudOneLoop () at libvirtd.c:2217
#12 0x000000000041a613 in qemudRunLoop (opaque=0x1f6f830) at libvirtd.c:2326
#13 0x0000003276807761 in start_thread () from /lib64/libpthread.so.0
#14 0x00000032760e14ed in clone () from /lib64/libc.so.6
This thread shows the problem. Guests must not be run directly
from the event loop thread, because startup requires waiting
for I/O events. So this thread is sitting on the condition
variable waiting for an I/O event to complete, but because
its doing this from the event loop thread the event loop
isn't running. So the condition will never be signalled.
This is completely unrelated to the other problems discussed
in this thread&  I'm surprised we've not seen it before now!
Yes, it's unrelated and came up through my testing of the code paths I 
touched with the deadlock-prevention patch...
...
When you send SIGHUP to libvirt this triggers a reload of  the
guest domain configs. For some reason we also have this SIGHUP
re-triggering autostart. IMHO this is a very big mistake. If
a guest is marked as autostart, I don't think an admin would
expect it to be started when just sending SIGHUP. I think we
should fix it so that autostart is only ever done at daemon
startup, not SIGHUP. This would avoid the entire problem code
path here
FWIW, I don't have any VM marked as 'autostart', but the code seems to 
be doing something 'for' VMs no matter whether they are marked as 
autostart or not, i.e., run  'qemuDomainObjBeginJobWithDriver'

    Stefan