[...]
>
> So short story made really long, I think the best course of action will
> be to add this patch and reorder the Unref()'s (adminProgram thru srv,
> but not dmn). It seems to resolve these corner cases, but I'm also open
> to other suggestions. Still need to think about it some more too before
> posting any patches.
>
>
Hi.
I'm not grasp the whole picture yet but I've managed to find out what
triggered the crash. It is not 2f3054c22 where you reordered unrefs but
1fd1b766105 which moves events unregistering from netserver client closing to
netservec client disposing. Before 1fd1b766105 we don't have crash
because clients simply do not get disposed.
Oh yeah, that one.... But considering Erik's most recent response in
this overall thread vis-a-vis the separation of "close" vs. "dispose"
and the timing of each w/r/t Unref and Free, I think having the call to
remoteClientFreePrivateCallbacks in remoteClientCloseFunc is perhaps
better than in remoteClientFreeFunc.
As to fixing the crash with this patch I thinks its is coincidence. I want
do dispose netservers early to join rpc threads and it turns out that
disposing also closing clients too and this fixes the problem.
Nikolay
With Cedric's patch in place, the virt-manager client issue is fixed. So
that's goodness.
If I then add the sleep (or usleep) into qemuConnectGetAllDomainStats as
noted in what started this all, then I can either get libvirtd to crash
dereferencing a NULL driver pointer or (my favorite) hang with two
threads stuck waiting:
(gdb) t a a bt
Thread 5 (Thread 0x7fffe535b700 (LWP 15568)):
#0 0x00007ffff3dc909d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007ffff3dc1e23 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2 0x00007ffff7299a15 in virMutexLock (m=<optimized out>)
at util/virthread.c:89
#3 0x00007fffc760621e in qemuDriverLock (driver=0x7fffbc190510)
at qemu/qemu_conf.c:100
#4 virQEMUDriverGetConfig (driver=driver@entry=0x7fffbc190510)
at qemu/qemu_conf.c:1002
#5 0x00007fffc75dfa89 in qemuDomainObjBeginJobInternal (
driver=driver@entry=0x7fffbc190510, obj=obj@entry=0x7fffbc3bcd60,
job=job@entry=QEMU_JOB_QUERY,
asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_NONE)
at qemu/qemu_domain.c:4690
#6 0x00007fffc75e3b2b in qemuDomainObjBeginJob (
driver=driver@entry=0x7fffbc190510, obj=obj@entry=0x7fffbc3bcd60,
job=job@entry=QEMU_JOB_QUERY) at qemu/qemu_domain.c:4842
#7 0x00007fffc764f744 in qemuConnectGetAllDomainStats
(conn=0x7fffb80009a0,
doms=<optimized out>, ndoms=<optimized out>, stats=<optimized out>,
retStats=0x7fffe535aaf0, flags=<optimized out>) at
qemu/qemu_driver.c:20219
#8 0x00007ffff736430a in virDomainListGetStats (doms=0x7fffa8000950,
stats=0,
retStats=retStats@entry=0x7fffe535aaf0, flags=0) at
libvirt-domain.c:11595
#9 0x000055555557948d in remoteDispatchConnectGetAllDomainStats (
server=<optimized out>, msg=<optimized out>, ret=0x7fffa80008e0,
args=0x7fffa80008c0, rerr=0x7fffe535abf0, client=<optimized out>)
at remote.c:6538
#10 remoteDispatchConnectGetAllDomainStatsHelper (server=<optimized out>,
client=<optimized out>, msg=<optimized out>, rerr=0x7fffe535abf0,
args=0x7fffa80008c0, ret=0x7fffa80008e0) at remote_dispatch.h:615
#11 0x00007ffff73bf59c in virNetServerProgramDispatchCall
(msg=0x55555586cdd0,
client=0x55555586bea0, server=0x55555582ed90, prog=0x555555869190)
at rpc/virnetserverprogram.c:437
#12 virNetServerProgramDispatch (prog=0x555555869190,
server=server@entry=0x55555582ed90, client=0x55555586bea0,
msg=0x55555586cdd0) at rpc/virnetserverprogram.c:307
#13 0x00005555555a9318 in virNetServerProcessMsg (msg=<optimized out>,
prog=<optimized out>, client=<optimized out>, srv=0x55555582ed90)
at rpc/virnetserver.c:148
#14 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x55555582ed90)
at rpc/virnetserver.c:169
#15 0x00007ffff729a521 in virThreadPoolWorker (
opaque=opaque@entry=0x55555583aa40) at util/virthreadpool.c:167
#16 0x00007ffff7299898 in virThreadHelper (data=<optimized out>)
at util/virthread.c:206
#17 0x00007ffff3dbf36d in start_thread () from /lib64/libpthread.so.0
#18 0x00007ffff3af3e1f in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7ffff7ef9d80 (LWP 15561)):
#0 0x00007ffff3dc590b in pthread_cond_wait@(a)GLIBC_2.3.2 ()
---Type <return> to continue, or q <return> to quit---
from /lib64/libpthread.so.0
#1 0x00007ffff7299af6 in virCondWait (c=<optimized out>, m=<optimized out>)
at util/virthread.c:154
#2 0x00007ffff729a760 in virThreadPoolFree (pool=<optimized out>)
at util/virthreadpool.c:290
#3 0x00005555555a8ec2 in virNetServerDispose (obj=0x55555582ed90)
at rpc/virnetserver.c:767
#4 0x00007ffff727923b in virObjectUnref (anyobj=<optimized out>)
at util/virobject.c:356
#5 0x00007ffff724f069 in virHashFree (table=<optimized out>)
at util/virhash.c:318
#6 0x00007ffff73b8295 in virNetDaemonDispose (obj=0x55555582eb10)
at rpc/virnetdaemon.c:105
#7 0x00007ffff727923b in virObjectUnref (anyobj=<optimized out>)
at util/virobject.c:356
#8 0x000055555556f2eb in main (argc=<optimized out>, argv=<optimized out>)
at libvirtd.c:1524
(gdb)
Of course this could be a red herring because sleep/usleep and the
condition handling nature of these jobs could be interfering with one
another.
Still adding the "virHashRemoveAll(dmn->servers);" into
virNetDaemonClose doesn't help the situation as I can still either crash
randomly or hang, so I'm less convinced this would really fix anything.
It does change the "nature" of the hung thread stack trace though, as
the second thread is now:
Thread 1 (Thread 0x7ffff7ef9d80 (LWP 20159)):
#0 0x00007ffff3dc590b in pthread_cond_wait@(a)GLIBC_2.3.2 ()
---Type <return> to continue, or q <return> to quit---
from /lib64/libpthread.so.0
#1 0x00007ffff7299b06 in virCondWait (c=<optimized out>, m=<optimized out>)
at util/virthread.c:154
#2 0x00007ffff729a770 in virThreadPoolFree (pool=<optimized out>)
at util/virthreadpool.c:290
#3 0x00005555555a8ec2 in virNetServerDispose (obj=0x55555582ed90)
at rpc/virnetserver.c:767
#4 0x00007ffff727924b in virObjectUnref (anyobj=<optimized out>)
at util/virobject.c:356
#5 0x000055555556f2e3 in main (argc=<optimized out>, argv=<optimized out>)
at libvirtd.c:1523
(gdb)
So we still haven't found the "root cause", but I think Erik is on to
something in the other part of this thread. I'll go there.
John