[libvirt] [PATCH v2 0/3] libvirtd: fix crashes on termination

diff from v1 ============ 1. drop patches 1-2 of v1. As to "event loop" patch seems nobody has "production" problems due to leaks on termination. Turning freeing callback objects registered in event loop on after it is finished has an impact that is hard to predict. Different parts of libvirt use event loop and the patch can trigger paths that were not passed before. Let's wait for real issue) "Breaking cyclic dependency" patch will not reach its target without "event loop patch" anyway - daemon object will still leak in certain situation so let's drop it too. 2. add code commenting patch It documents another reason why we should not free callback object synchronously in remove handle/timeout function besides Dan's objection: https://www.redhat.com/archives/libvir-list/2016-September/msg01005.html Reproducing: ============ Crash situation of patches 1-2 can easily be simulated, just patch libvirt: diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 30a2830..f6b71d6 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4515,6 +4515,8 @@ processMonitorEOFEvent(virQEMUDriverPtr driver, unsigned int stopFlags = 0; virObjectEventPtr event = NULL; + sleep(3); + if (qemuProcessBeginStopJob(driver, vm, QEMU_JOB_DESTROY, true) < 0) return; then when it is up and running and there is also a qemu domain running: kill -9 $QEMU_DOMAIN && pkill libvirtd By the way there should be no admin connection at the moment or crash will not happen due to leaks. Nikolay Shirokovskiy (3): daemon: keep daemon until all hypervisors drivers are cleaned up qemu: first wait all workers finish on state cleanup util: event loop: document another reason to defer deletion daemon/libvirtd.c | 4 +++- src/qemu/qemu_driver.c | 2 +- src/util/vireventpoll.c | 7 +++++++ 3 files changed, 11 insertions(+), 2 deletions(-) -- 1.8.3.1

This fix SEGV with the backtrace [1] This happens due to race on simultaneous finishing of libvirtd and qemu process. We need to keep daemon object until all hypervisor drivers are cleaned up. The other option is to take reference to daemon in every hypervisor driver but this will take more work and we really don't need to. Drivers cleanup procedure is synchronous thus let's just remove last reference to daemon after drivers cleanup. [1] backtrace (shortened a bit) 1 0x00007fd3a791b2e6 in virCondWait (c=<optimized out>, m=<optimized out>) at util/virthread.c:154 2 0x00007fd3a791bcb0 in virThreadPoolFree (pool=0x7fd38024ee00) at util/virthreadpool.c:266 3 0x00007fd38edaa00e in qemuStateCleanup () at qemu/qemu_driver.c:1116 4 0x00007fd3a79abfeb in virStateCleanup () at libvirt.c:808 5 0x00007fd3a85f2c9e in main (argc=<optimized out>, argv=<optimized out>) at libvirtd.c:1660 Thread 1 (Thread 0x7fd38722d700 (LWP 32256)): 0 0x00007fd3a7900910 in virClassIsDerivedFrom (klass=0xdfd36058d4853, parent=0x7fd3a8f394d0) at util/virobject.c:169 1 0x00007fd3a7900c4e in virObjectIsClass (anyobj=anyobj@entry=0x7fd3a8f2f850, klass=<optimized out>) at util/virobject.c:365 2 0x00007fd3a7900c74 in virObjectLock (anyobj=0x7fd3a8f2f850) at util/virobject.c:317 3 0x00007fd3a7a24d5d in virNetDaemonRemoveShutdownInhibition (dmn=0x7fd3a8f2f850) at rpc/virnetdaemon.c:547 4 0x00007fd38ed722cf in qemuProcessStop (driver=driver@entry=0x7fd380103810, vm=vm@entry=0x7fd38025b6d0, reason=reason@entry=VIR_DOMAIN_SHUTOFF_SHUTDOWN, asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_NONE, flags=flags@entry=0) at qemu/qemu_process.c:5786 5 0x00007fd38edd9428 in processMonitorEOFEvent (vm=0x7fd38025b6d0, driver=0x7fd380103810) at qemu/qemu_driver.c:4588 6 qemuProcessEventHandler (data=<optimized out>, opaque=0x7fd380103810) at qemu/qemu_driver.c:4632 7 0x00007fd3a791bb55 in virThreadPoolWorker (opaque=opaque@entry=0x7fd3a8f1e4c0) at util/virthreadpool.c:145 --- daemon/libvirtd.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/daemon/libvirtd.c b/daemon/libvirtd.c index 95c1b1c..eefdefc 100644 --- a/daemon/libvirtd.c +++ b/daemon/libvirtd.c @@ -1628,7 +1628,6 @@ int main(int argc, char **argv) { virObjectUnref(qemuProgram); virObjectUnref(adminProgram); virNetDaemonClose(dmn); - virObjectUnref(dmn); virObjectUnref(srv); virObjectUnref(srvAdm); virNetlinkShutdown(); @@ -1658,6 +1657,9 @@ int main(int argc, char **argv) { driversInitialized = false; virStateCleanup(); } + /* unref daemon only here as hypervisor drivers can + call shutdown inhibition functions on cleanup */ + virObjectUnref(dmn); return ret; } -- 1.8.3.1

On 10/04/2016 10:27 AM, Nikolay Shirokovskiy wrote:
This fix SEGV with the backtrace [1]
This happens due to race on simultaneous finishing of libvirtd and qemu process. We need to keep daemon object until all hypervisor drivers are cleaned up. The other option is to take reference to daemon in every hypervisor driver but this will take more work and we really don't need to. Drivers cleanup procedure is synchronous thus let's just remove last reference to daemon after drivers cleanup.
[1] backtrace (shortened a bit)
1 0x00007fd3a791b2e6 in virCondWait (c=<optimized out>, m=<optimized out>) at util/virthread.c:154 2 0x00007fd3a791bcb0 in virThreadPoolFree (pool=0x7fd38024ee00) at util/virthreadpool.c:266 3 0x00007fd38edaa00e in qemuStateCleanup () at qemu/qemu_driver.c:1116 4 0x00007fd3a79abfeb in virStateCleanup () at libvirt.c:808 5 0x00007fd3a85f2c9e in main (argc=<optimized out>, argv=<optimized out>) at libvirtd.c:1660
Thread 1 (Thread 0x7fd38722d700 (LWP 32256)): 0 0x00007fd3a7900910 in virClassIsDerivedFrom (klass=0xdfd36058d4853, parent=0x7fd3a8f394d0) at util/virobject.c:169 1 0x00007fd3a7900c4e in virObjectIsClass (anyobj=anyobj@entry=0x7fd3a8f2f850, klass=<optimized out>) at util/virobject.c:365 2 0x00007fd3a7900c74 in virObjectLock (anyobj=0x7fd3a8f2f850) at util/virobject.c:317 3 0x00007fd3a7a24d5d in virNetDaemonRemoveShutdownInhibition (dmn=0x7fd3a8f2f850) at rpc/virnetdaemon.c:547 4 0x00007fd38ed722cf in qemuProcessStop (driver=driver@entry=0x7fd380103810, vm=vm@entry=0x7fd38025b6d0, reason=reason@entry=VIR_DOMAIN_SHUTOFF_SHUTDOWN, asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_NONE, flags=flags@entry=0) at qemu/qemu_process.c:5786 5 0x00007fd38edd9428 in processMonitorEOFEvent (vm=0x7fd38025b6d0, driver=0x7fd380103810) at qemu/qemu_driver.c:4588 6 qemuProcessEventHandler (data=<optimized out>, opaque=0x7fd380103810) at qemu/qemu_driver.c:4632 7 0x00007fd3a791bb55 in virThreadPoolWorker (opaque=opaque@entry=0x7fd3a8f1e4c0) at util/virthreadpool.c:145 --- daemon/libvirtd.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
I'd like to adjust the commit message just a bit... In particular removing the "The other option..." sentence. So the "problem" discovered is that virStateInitialize requires/uses the "dmn" reference as the argument to the daemonInhibitCallback and by dereferencing the object too early and then calling the virStateCleanup which calls the daemonInhibitCallback using the dmn we end up with the SEGV.
diff --git a/daemon/libvirtd.c b/daemon/libvirtd.c index 95c1b1c..eefdefc 100644 --- a/daemon/libvirtd.c +++ b/daemon/libvirtd.c @@ -1628,7 +1628,6 @@ int main(int argc, char **argv) { virObjectUnref(qemuProgram); virObjectUnref(adminProgram); virNetDaemonClose(dmn); - virObjectUnref(dmn); virObjectUnref(srv); virObjectUnref(srvAdm); virNetlinkShutdown(); @@ -1658,6 +1657,9 @@ int main(int argc, char **argv) { driversInitialized = false; virStateCleanup(); }
Another option would have been to move the above hunk up... Since setting the daemonStateInit is one of the last things done before running the event loop, it stands to reason shutting down should be done in the opposite order thus causing those InhibitCallbacks to be run perhaps after virNetlinkEventServiceStopAll and before virNetDaemonClose. Any thoughts to that? This I assume works, but do we run into "other" issues because we're running those callbacks after virNetDaemonClose and virNetlinkShutdown? Of course the "fear" of moving is that it causes "other" timing issues with previous adjustments here, in particular: commit id '5c756e580' and 'a602e90bc1' which introduced and adjusted driversInitialized I'm OK with leaving things as is, but perhaps someone else will see this an comment as well. John
+ /* unref daemon only here as hypervisor drivers can + call shutdown inhibition functions on cleanup */ + virObjectUnref(dmn);
return ret; }

On 27.10.2016 15:59, John Ferlan wrote:
On 10/04/2016 10:27 AM, Nikolay Shirokovskiy wrote:
This fix SEGV with the backtrace [1]
This happens due to race on simultaneous finishing of libvirtd and qemu process. We need to keep daemon object until all hypervisor drivers are cleaned up. The other option is to take reference to daemon in every hypervisor driver but this will take more work and we really don't need to. Drivers cleanup procedure is synchronous thus let's just remove last reference to daemon after drivers cleanup.
[1] backtrace (shortened a bit)
1 0x00007fd3a791b2e6 in virCondWait (c=<optimized out>, m=<optimized out>) at util/virthread.c:154 2 0x00007fd3a791bcb0 in virThreadPoolFree (pool=0x7fd38024ee00) at util/virthreadpool.c:266 3 0x00007fd38edaa00e in qemuStateCleanup () at qemu/qemu_driver.c:1116 4 0x00007fd3a79abfeb in virStateCleanup () at libvirt.c:808 5 0x00007fd3a85f2c9e in main (argc=<optimized out>, argv=<optimized out>) at libvirtd.c:1660
Thread 1 (Thread 0x7fd38722d700 (LWP 32256)): 0 0x00007fd3a7900910 in virClassIsDerivedFrom (klass=0xdfd36058d4853, parent=0x7fd3a8f394d0) at util/virobject.c:169 1 0x00007fd3a7900c4e in virObjectIsClass (anyobj=anyobj@entry=0x7fd3a8f2f850, klass=<optimized out>) at util/virobject.c:365 2 0x00007fd3a7900c74 in virObjectLock (anyobj=0x7fd3a8f2f850) at util/virobject.c:317 3 0x00007fd3a7a24d5d in virNetDaemonRemoveShutdownInhibition (dmn=0x7fd3a8f2f850) at rpc/virnetdaemon.c:547 4 0x00007fd38ed722cf in qemuProcessStop (driver=driver@entry=0x7fd380103810, vm=vm@entry=0x7fd38025b6d0, reason=reason@entry=VIR_DOMAIN_SHUTOFF_SHUTDOWN, asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_NONE, flags=flags@entry=0) at qemu/qemu_process.c:5786 5 0x00007fd38edd9428 in processMonitorEOFEvent (vm=0x7fd38025b6d0, driver=0x7fd380103810) at qemu/qemu_driver.c:4588 6 qemuProcessEventHandler (data=<optimized out>, opaque=0x7fd380103810) at qemu/qemu_driver.c:4632 7 0x00007fd3a791bb55 in virThreadPoolWorker (opaque=opaque@entry=0x7fd3a8f1e4c0) at util/virthreadpool.c:145 --- daemon/libvirtd.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
I'd like to adjust the commit message just a bit... In particular removing the "The other option..." sentence.
ok
So the "problem" discovered is that virStateInitialize requires/uses the "dmn" reference as the argument to the daemonInhibitCallback and by dereferencing the object too early and then calling the virStateCleanup which calls the daemonInhibitCallback using the dmn we end up with the SEGV.
yep
diff --git a/daemon/libvirtd.c b/daemon/libvirtd.c index 95c1b1c..eefdefc 100644 --- a/daemon/libvirtd.c +++ b/daemon/libvirtd.c @@ -1628,7 +1628,6 @@ int main(int argc, char **argv) { virObjectUnref(qemuProgram); virObjectUnref(adminProgram); virNetDaemonClose(dmn); - virObjectUnref(dmn); virObjectUnref(srv); virObjectUnref(srvAdm); virNetlinkShutdown(); @@ -1658,6 +1657,9 @@ int main(int argc, char **argv) { driversInitialized = false; virStateCleanup(); }
Another option would have been to move the above hunk up... Since setting the daemonStateInit is one of the last things done before running the event loop, it stands to reason shutting down should be done in the opposite order thus causing those InhibitCallbacks to be run perhaps after virNetlinkEventServiceStopAll and before virNetDaemonClose.
By the above hunk you mean virStateCleanup() et all? If yes - then I don't think so. The thing is that start is tricky. 1. We start the network part, but keep it in disabled state. In this state it creates sockets and listens on them but not accept connections. 2. Initialize all the hyper drivers and then enable the network part to let it accept connections. Thus effectively the order is reverse to what is seems - first hyper drivers and then network servers. Actually there is a direct reasoning why hyper drivers shutted down after network part - otherwise requests can be delivered to hyper drivers in invalid state. Nikolay
Any thoughts to that?
This I assume works, but do we run into "other" issues because we're running those callbacks after virNetDaemonClose and virNetlinkShutdown?
Of course the "fear" of moving is that it causes "other" timing issues with previous adjustments here, in particular:
commit id '5c756e580' and 'a602e90bc1' which introduced and adjusted driversInitialized
I'm OK with leaving things as is, but perhaps someone else will see this an comment as well.
John
+ /* unref daemon only here as hypervisor drivers can + call shutdown inhibition functions on cleanup */ + virObjectUnref(dmn);
return ret; }

On 10/27/2016 10:04 AM, Nikolay Shirokovskiy wrote:
On 27.10.2016 15:59, John Ferlan wrote:
On 10/04/2016 10:27 AM, Nikolay Shirokovskiy wrote:
This fix SEGV with the backtrace [1]
This happens due to race on simultaneous finishing of libvirtd and qemu process. We need to keep daemon object until all hypervisor drivers are cleaned up. The other option is to take reference to daemon in every hypervisor driver but this will take more work and we really don't need to. Drivers cleanup procedure is synchronous thus let's just remove last reference to daemon after drivers cleanup.
[1] backtrace (shortened a bit)
1 0x00007fd3a791b2e6 in virCondWait (c=<optimized out>, m=<optimized out>) at util/virthread.c:154 2 0x00007fd3a791bcb0 in virThreadPoolFree (pool=0x7fd38024ee00) at util/virthreadpool.c:266 3 0x00007fd38edaa00e in qemuStateCleanup () at qemu/qemu_driver.c:1116 4 0x00007fd3a79abfeb in virStateCleanup () at libvirt.c:808 5 0x00007fd3a85f2c9e in main (argc=<optimized out>, argv=<optimized out>) at libvirtd.c:1660
Thread 1 (Thread 0x7fd38722d700 (LWP 32256)): 0 0x00007fd3a7900910 in virClassIsDerivedFrom (klass=0xdfd36058d4853, parent=0x7fd3a8f394d0) at util/virobject.c:169 1 0x00007fd3a7900c4e in virObjectIsClass (anyobj=anyobj@entry=0x7fd3a8f2f850, klass=<optimized out>) at util/virobject.c:365 2 0x00007fd3a7900c74 in virObjectLock (anyobj=0x7fd3a8f2f850) at util/virobject.c:317 3 0x00007fd3a7a24d5d in virNetDaemonRemoveShutdownInhibition (dmn=0x7fd3a8f2f850) at rpc/virnetdaemon.c:547 4 0x00007fd38ed722cf in qemuProcessStop (driver=driver@entry=0x7fd380103810, vm=vm@entry=0x7fd38025b6d0, reason=reason@entry=VIR_DOMAIN_SHUTOFF_SHUTDOWN, asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_NONE, flags=flags@entry=0) at qemu/qemu_process.c:5786 5 0x00007fd38edd9428 in processMonitorEOFEvent (vm=0x7fd38025b6d0, driver=0x7fd380103810) at qemu/qemu_driver.c:4588 6 qemuProcessEventHandler (data=<optimized out>, opaque=0x7fd380103810) at qemu/qemu_driver.c:4632 7 0x00007fd3a791bb55 in virThreadPoolWorker (opaque=opaque@entry=0x7fd3a8f1e4c0) at util/virthreadpool.c:145 --- daemon/libvirtd.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
I'd like to adjust the commit message just a bit... In particular removing the "The other option..." sentence.
ok
So the "problem" discovered is that virStateInitialize requires/uses the "dmn" reference as the argument to the daemonInhibitCallback and by dereferencing the object too early and then calling the virStateCleanup which calls the daemonInhibitCallback using the dmn we end up with the SEGV.
yep
diff --git a/daemon/libvirtd.c b/daemon/libvirtd.c index 95c1b1c..eefdefc 100644 --- a/daemon/libvirtd.c +++ b/daemon/libvirtd.c @@ -1628,7 +1628,6 @@ int main(int argc, char **argv) { virObjectUnref(qemuProgram); virObjectUnref(adminProgram); virNetDaemonClose(dmn); - virObjectUnref(dmn); virObjectUnref(srv); virObjectUnref(srvAdm); virNetlinkShutdown(); @@ -1658,6 +1657,9 @@ int main(int argc, char **argv) { driversInitialized = false; virStateCleanup(); }
Another option would have been to move the above hunk up... Since setting the daemonStateInit is one of the last things done before running the event loop, it stands to reason shutting down should be done in the opposite order thus causing those InhibitCallbacks to be run perhaps after virNetlinkEventServiceStopAll and before virNetDaemonClose.
By the above hunk you mean virStateCleanup() et all? If yes - then I don't think so. The thing is that start is tricky.
1. We start the network part, but keep it in disabled state. In this state it creates sockets and listens on them but not accept connections.
2. Initialize all the hyper drivers and then enable the network part to let it accept connections.
Thus effectively the order is reverse to what is seems - first hyper drivers and then network servers.
Actually there is a direct reasoning why hyper drivers shutted down after network part - otherwise requests can be delivered to hyper drivers in invalid state.
Yeah - it's an intricate and delicate balance, especially that inhibit callback stuff. Since you only have 1 domain and 1 libvirtd I'll keep the order as is... Tks - John
Nikolay
Any thoughts to that?
This I assume works, but do we run into "other" issues because we're running those callbacks after virNetDaemonClose and virNetlinkShutdown?
Of course the "fear" of moving is that it causes "other" timing issues with previous adjustments here, in particular:
commit id '5c756e580' and 'a602e90bc1' which introduced and adjusted driversInitialized
I'm OK with leaving things as is, but perhaps someone else will see this an comment as well.
John
+ /* unref daemon only here as hypervisor drivers can + call shutdown inhibition functions on cleanup */ + virObjectUnref(dmn);
return ret; }

This fix the crash [1] This happens on race conditions of simultaneous exiting libvirtd and qemu process. Let's just finish all workers that keep driver pointer before disposing driver object or any of its members. [1] crash backtrace: (bt is shortened a bit): 0 0x00007ffff7282f2b in virClassIsDerivedFrom (klass=0xdeadbeef, parent=0x55555581d650) at util/virobject.c:169 1 0x00007ffff72835fd in virObjectIsClass (anyobj=0x7fffd024f580, klass=0x55555581d650) at util/virobject.c:365 2 0x00007ffff7283498 in virObjectLock (anyobj=0x7fffd024f580) at util/virobject.c:317 3 0x00007ffff722f0a3 in virCloseCallbacksUnset (closeCallbacks=0x7fffd024f580, vm=0x7fffd0194db0, cb=0x7fffdf1af765 <qemuProcessAutoDestroy>) at util/virclosecallbacks.c:164 4 0x00007fffdf1afa7b in qemuProcessAutoDestroyRemove (driver=0x7fffd00f3a60, vm=0x7fffd0194db0) at qemu/qemu_process.c:6365 5 0x00007fffdf1adff1 in qemuProcessStop (driver=0x7fffd00f3a60, vm=0x7fffd0194db0, reason=VIR_DOMAIN_SHUTOFF_CRASHED, asyncJob=QEMU_ASYNC_JOB_NONE, flags=0) at qemu/qemu_process.c:5877 6 0x00007fffdf1f711c in processMonitorEOFEvent (driver=0x7fffd00f3a60, vm=0x7fffd0194db0) at qemu/qemu_driver.c:4545 7 0x00007fffdf1f7313 in qemuProcessEventHandler (data=0x555555832710, opaque=0x7fffd00f3a60) at qemu/qemu_driver.c:4589 8 0x00007ffff72a84c4 in virThreadPoolWorker (opaque=0x555555805da0) at util/virthreadpool.c:167 Thread 1 (Thread 0x7ffff7fb1880 (LWP 494472)): 1 0x00007ffff72a7898 in virCondWait (c=0x7fffd01c21f8, m=0x7fffd01c21a0) at util/virthread.c:154 2 0x00007ffff72a8a22 in virThreadPoolFree (pool=0x7fffd01c2160) at util/virthreadpool.c:290 3 0x00007fffdf1edd44 in qemuStateCleanup () at qemu/qemu_driver.c:1102 4 0x00007ffff736570a in virStateCleanup () at libvirt.c:807 5 0x000055555556f991 in main (argc=1, argv=0x7fffffffe458) at libvirtd.c:1660 --- src/qemu/qemu_driver.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 12ddbc0..58753ea 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -1068,6 +1068,7 @@ qemuStateCleanup(void) if (!qemu_driver) return -1; + virThreadPoolFree(qemu_driver->workerPool); virNWFilterUnRegisterCallbackDriver(&qemuCallbackDriver); virObjectUnref(qemu_driver->config); virObjectUnref(qemu_driver->hostdevMgr); @@ -1099,7 +1100,6 @@ qemuStateCleanup(void) virLockManagerPluginUnref(qemu_driver->lockManager); virMutexDestroy(&qemu_driver->lock); - virThreadPoolFree(qemu_driver->workerPool); VIR_FREE(qemu_driver); return 0; -- 1.8.3.1

On 10/04/2016 10:27 AM, Nikolay Shirokovskiy wrote:
This fix the crash [1]
This happens on race conditions of simultaneous exiting libvirtd and qemu process. Let's just finish all workers that keep driver pointer before disposing driver object or any of its members.
[1] crash backtrace: (bt is shortened a bit):
Before pushing I'll reword the above just a little bit for readability
0 0x00007ffff7282f2b in virClassIsDerivedFrom (klass=0xdeadbeef, parent=0x55555581d650) at util/virobject.c:169 1 0x00007ffff72835fd in virObjectIsClass (anyobj=0x7fffd024f580, klass=0x55555581d650) at util/virobject.c:365 2 0x00007ffff7283498 in virObjectLock (anyobj=0x7fffd024f580) at util/virobject.c:317 3 0x00007ffff722f0a3 in virCloseCallbacksUnset (closeCallbacks=0x7fffd024f580, vm=0x7fffd0194db0, cb=0x7fffdf1af765 <qemuProcessAutoDestroy>) at util/virclosecallbacks.c:164 4 0x00007fffdf1afa7b in qemuProcessAutoDestroyRemove (driver=0x7fffd00f3a60, vm=0x7fffd0194db0) at qemu/qemu_process.c:6365 5 0x00007fffdf1adff1 in qemuProcessStop (driver=0x7fffd00f3a60, vm=0x7fffd0194db0, reason=VIR_DOMAIN_SHUTOFF_CRASHED, asyncJob=QEMU_ASYNC_JOB_NONE, flags=0) at qemu/qemu_process.c:5877 6 0x00007fffdf1f711c in processMonitorEOFEvent (driver=0x7fffd00f3a60, vm=0x7fffd0194db0) at qemu/qemu_driver.c:4545 7 0x00007fffdf1f7313 in qemuProcessEventHandler (data=0x555555832710, opaque=0x7fffd00f3a60) at qemu/qemu_driver.c:4589 8 0x00007ffff72a84c4 in virThreadPoolWorker (opaque=0x555555805da0) at util/virthreadpool.c:167
Thread 1 (Thread 0x7ffff7fb1880 (LWP 494472)): 1 0x00007ffff72a7898 in virCondWait (c=0x7fffd01c21f8, m=0x7fffd01c21a0) at util/virthread.c:154 2 0x00007ffff72a8a22 in virThreadPoolFree (pool=0x7fffd01c2160) at util/virthreadpool.c:290 3 0x00007fffdf1edd44 in qemuStateCleanup () at qemu/qemu_driver.c:1102 4 0x00007ffff736570a in virStateCleanup () at libvirt.c:807 5 0x000055555556f991 in main (argc=1, argv=0x7fffffffe458) at libvirtd.c:1660 --- src/qemu/qemu_driver.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 12ddbc0..58753ea 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -1068,6 +1068,7 @@ qemuStateCleanup(void) if (!qemu_driver) return -1;
+ virThreadPoolFree(qemu_driver->workerPool);
For this one - your thoughts on swapping the order to be NWFilterUnRegister and ThreadPoolFree(workerPool)? Nothing in the NW filter call back jumps out at me as generating a workerPool job, but since the order of Initialization is workerPool, then NWFilterCallbackRegister I figure the order to Cleanup should just be inversed. ACK for what's here, but I'll await yours or any other feedback on swapping before pushing. John
virNWFilterUnRegisterCallbackDriver(&qemuCallbackDriver); virObjectUnref(qemu_driver->config); virObjectUnref(qemu_driver->hostdevMgr); @@ -1099,7 +1100,6 @@ qemuStateCleanup(void) virLockManagerPluginUnref(qemu_driver->lockManager);
virMutexDestroy(&qemu_driver->lock); - virThreadPoolFree(qemu_driver->workerPool); VIR_FREE(qemu_driver);
return 0;

On 27.10.2016 16:00, John Ferlan wrote:
On 10/04/2016 10:27 AM, Nikolay Shirokovskiy wrote:
This fix the crash [1]
This happens on race conditions of simultaneous exiting libvirtd and qemu process. Let's just finish all workers that keep driver pointer before disposing driver object or any of its members.
[1] crash backtrace: (bt is shortened a bit):
Before pushing I'll reword the above just a little bit for readability
0 0x00007ffff7282f2b in virClassIsDerivedFrom (klass=0xdeadbeef, parent=0x55555581d650) at util/virobject.c:169 1 0x00007ffff72835fd in virObjectIsClass (anyobj=0x7fffd024f580, klass=0x55555581d650) at util/virobject.c:365 2 0x00007ffff7283498 in virObjectLock (anyobj=0x7fffd024f580) at util/virobject.c:317 3 0x00007ffff722f0a3 in virCloseCallbacksUnset (closeCallbacks=0x7fffd024f580, vm=0x7fffd0194db0, cb=0x7fffdf1af765 <qemuProcessAutoDestroy>) at util/virclosecallbacks.c:164 4 0x00007fffdf1afa7b in qemuProcessAutoDestroyRemove (driver=0x7fffd00f3a60, vm=0x7fffd0194db0) at qemu/qemu_process.c:6365 5 0x00007fffdf1adff1 in qemuProcessStop (driver=0x7fffd00f3a60, vm=0x7fffd0194db0, reason=VIR_DOMAIN_SHUTOFF_CRASHED, asyncJob=QEMU_ASYNC_JOB_NONE, flags=0) at qemu/qemu_process.c:5877 6 0x00007fffdf1f711c in processMonitorEOFEvent (driver=0x7fffd00f3a60, vm=0x7fffd0194db0) at qemu/qemu_driver.c:4545 7 0x00007fffdf1f7313 in qemuProcessEventHandler (data=0x555555832710, opaque=0x7fffd00f3a60) at qemu/qemu_driver.c:4589 8 0x00007ffff72a84c4 in virThreadPoolWorker (opaque=0x555555805da0) at util/virthreadpool.c:167
Thread 1 (Thread 0x7ffff7fb1880 (LWP 494472)): 1 0x00007ffff72a7898 in virCondWait (c=0x7fffd01c21f8, m=0x7fffd01c21a0) at util/virthread.c:154 2 0x00007ffff72a8a22 in virThreadPoolFree (pool=0x7fffd01c2160) at util/virthreadpool.c:290 3 0x00007fffdf1edd44 in qemuStateCleanup () at qemu/qemu_driver.c:1102 4 0x00007ffff736570a in virStateCleanup () at libvirt.c:807 5 0x000055555556f991 in main (argc=1, argv=0x7fffffffe458) at libvirtd.c:1660 --- src/qemu/qemu_driver.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 12ddbc0..58753ea 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -1068,6 +1068,7 @@ qemuStateCleanup(void) if (!qemu_driver) return -1;
+ virThreadPoolFree(qemu_driver->workerPool);
For this one - your thoughts on swapping the order to be NWFilterUnRegister and ThreadPoolFree(workerPool)?
Nothing in the NW filter call back jumps out at me as generating a workerPool job, but since the order of Initialization is workerPool, then NWFilterCallbackRegister I figure the order to Cleanup should just be inversed.
Looks like these two are independent... So we can use any order and cleaning up in reverse order to initializing is probably good style.
ACK for what's here, but I'll await yours or any other feedback on swapping before pushing.
John
virNWFilterUnRegisterCallbackDriver(&qemuCallbackDriver); virObjectUnref(qemu_driver->config); virObjectUnref(qemu_driver->hostdevMgr); @@ -1099,7 +1100,6 @@ qemuStateCleanup(void) virLockManagerPluginUnref(qemu_driver->lockManager);
virMutexDestroy(&qemu_driver->lock); - virThreadPoolFree(qemu_driver->workerPool); VIR_FREE(qemu_driver);
return 0;

This is why we should not free callback object synchronously upon removing handle or timeout. Imagine: 1. loop thread, drops the lock and is about to call event callback 2. another thread, enters remove handle and frees callback object 3. loop thread, enters event callback, uses callback object BOOOM --- src/util/vireventpoll.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/src/util/vireventpoll.c b/src/util/vireventpoll.c index 81ecab4..802b679 100644 --- a/src/util/vireventpoll.c +++ b/src/util/vireventpoll.c @@ -176,6 +176,10 @@ void virEventPollUpdateHandle(int watch, int events) * Unregister a callback from a file handle * NB, it *must* be safe to call this from within a callback * For this reason we only ever set a flag in the existing list. + * Another reason is that as we drop the lock upon event callback + * invocation we can't free callback object if we called + * from a thread then loop thread. + * * Actual deletion will be done out-of-band */ int virEventPollRemoveHandle(int watch) @@ -295,6 +299,9 @@ void virEventPollUpdateTimeout(int timer, int frequency) * Unregister a callback for a timer * NB, it *must* be safe to call this from within a callback * For this reason we only ever set a flag in the existing list. + * Another reason is that as we drop the lock upon event callback + * invocation we can't free callback object if we called + * from a thread then loop thread. * Actual deletion will be done out-of-band */ int virEventPollRemoveTimeout(int timer) -- 1.8.3.1

On 10/04/2016 10:27 AM, Nikolay Shirokovskiy wrote:
This is why we should not free callback object synchronously upon removing handle or timeout. Imagine:
1. loop thread, drops the lock and is about to call event callback 2. another thread, enters remove handle and frees callback object 3. loop thread, enters event callback, uses callback object BOOOM --- src/util/vireventpoll.c | 7 +++++++ 1 file changed, 7 insertions(+)
I'm having difficulty trying to decipher the point you're trying to make in the context of the existing comments, the previous upstream series, and the cover letter explanation. While not explicitly stated for each, the 'flag' that's set is 'deleted'. Once the 'deleted' flag is set, it's expected that the object will be reaped later when it's safer to do so (such as virEventPollCleanupTimeouts and virEventPollCleanupHandles) via the 'ff' passed during virEventAddHandle and virEventAddTimeout. I'm going to pass on pushing this one. John
diff --git a/src/util/vireventpoll.c b/src/util/vireventpoll.c index 81ecab4..802b679 100644 --- a/src/util/vireventpoll.c +++ b/src/util/vireventpoll.c @@ -176,6 +176,10 @@ void virEventPollUpdateHandle(int watch, int events) * Unregister a callback from a file handle * NB, it *must* be safe to call this from within a callback * For this reason we only ever set a flag in the existing list. + * Another reason is that as we drop the lock upon event callback + * invocation we can't free callback object if we called + * from a thread then loop thread. + * * Actual deletion will be done out-of-band */ int virEventPollRemoveHandle(int watch) @@ -295,6 +299,9 @@ void virEventPollUpdateTimeout(int timer, int frequency) * Unregister a callback for a timer * NB, it *must* be safe to call this from within a callback * For this reason we only ever set a flag in the existing list. + * Another reason is that as we drop the lock upon event callback + * invocation we can't free callback object if we called + * from a thread then loop thread. * Actual deletion will be done out-of-band */ int virEventPollRemoveTimeout(int timer)

On 28.10.2016 00:58, John Ferlan wrote:
On 10/04/2016 10:27 AM, Nikolay Shirokovskiy wrote:
This is why we should not free callback object synchronously upon removing handle or timeout. Imagine:
1. loop thread, drops the lock and is about to call event callback 2. another thread, enters remove handle and frees callback object 3. loop thread, enters event callback, uses callback object BOOOM --- src/util/vireventpoll.c | 7 +++++++ 1 file changed, 7 insertions(+)
I'm having difficulty trying to decipher the point you're trying to make in the context of the existing comments, the previous upstream series, and the cover letter explanation.
While not explicitly stated for each, the 'flag' that's set is 'deleted'. Once the 'deleted' flag is set, it's expected that the object will be reaped later when it's safer to do so (such as virEventPollCleanupTimeouts and virEventPollCleanupHandles) via the 'ff' passed during virEventAddHandle and virEventAddTimeout.
All this is true. I've just wanted to add why we need keep this state of affairs. The thing is that in previous patch series I tried to free callback object instantly rather then deffering freeing. This way I tried to address a specific problem and this looks like possible solution. Then Daniel noted that this implementation is public and I'll break API. Meanwhile I find out that I actually can not free callback object without deffering and want to comment on this. Nikolay
I'm going to pass on pushing this one.
John
diff --git a/src/util/vireventpoll.c b/src/util/vireventpoll.c index 81ecab4..802b679 100644 --- a/src/util/vireventpoll.c +++ b/src/util/vireventpoll.c @@ -176,6 +176,10 @@ void virEventPollUpdateHandle(int watch, int events) * Unregister a callback from a file handle * NB, it *must* be safe to call this from within a callback * For this reason we only ever set a flag in the existing list. + * Another reason is that as we drop the lock upon event callback + * invocation we can't free callback object if we called + * from a thread then loop thread. + * * Actual deletion will be done out-of-band */ int virEventPollRemoveHandle(int watch) @@ -295,6 +299,9 @@ void virEventPollUpdateTimeout(int timer, int frequency) * Unregister a callback for a timer * NB, it *must* be safe to call this from within a callback * For this reason we only ever set a flag in the existing list. + * Another reason is that as we drop the lock upon event callback + * invocation we can't free callback object if we called + * from a thread then loop thread.
I've missed a word: from a thread other then loop thread
* Actual deletion will be done out-of-band */ int virEventPollRemoveTimeout(int timer)

So the above patches are pushed now. What about the last one? I've elaborated on its meaning in sibling letter. Nikolay On 28.10.2016 00:58, John Ferlan wrote:
On 10/04/2016 10:27 AM, Nikolay Shirokovskiy wrote:
This is why we should not free callback object synchronously upon removing handle or timeout. Imagine:
1. loop thread, drops the lock and is about to call event callback 2. another thread, enters remove handle and frees callback object 3. loop thread, enters event callback, uses callback object BOOOM --- src/util/vireventpoll.c | 7 +++++++ 1 file changed, 7 insertions(+)
I'm having difficulty trying to decipher the point you're trying to make in the context of the existing comments, the previous upstream series, and the cover letter explanation.
While not explicitly stated for each, the 'flag' that's set is 'deleted'. Once the 'deleted' flag is set, it's expected that the object will be reaped later when it's safer to do so (such as virEventPollCleanupTimeouts and virEventPollCleanupHandles) via the 'ff' passed during virEventAddHandle and virEventAddTimeout.
I'm going to pass on pushing this one.
John
diff --git a/src/util/vireventpoll.c b/src/util/vireventpoll.c index 81ecab4..802b679 100644 --- a/src/util/vireventpoll.c +++ b/src/util/vireventpoll.c @@ -176,6 +176,10 @@ void virEventPollUpdateHandle(int watch, int events) * Unregister a callback from a file handle * NB, it *must* be safe to call this from within a callback * For this reason we only ever set a flag in the existing list. + * Another reason is that as we drop the lock upon event callback + * invocation we can't free callback object if we called + * from a thread then loop thread. + * * Actual deletion will be done out-of-band */ int virEventPollRemoveHandle(int watch) @@ -295,6 +299,9 @@ void virEventPollUpdateTimeout(int timer, int frequency) * Unregister a callback for a timer * NB, it *must* be safe to call this from within a callback * For this reason we only ever set a flag in the existing list. + * Another reason is that as we drop the lock upon event callback + * invocation we can't free callback object if we called + * from a thread then loop thread. * Actual deletion will be done out-of-band */ int virEventPollRemoveTimeout(int timer)

On 11/14/2016 09:27 AM, Nikolay Shirokovskiy wrote:
So the above patches are pushed now. What about the last one? I've elaborated on its meaning in sibling letter.
Nikolay
You can always retry with a new patch that adds what you think will be helpful. Keep in mind what I wrote, what you wrote, and propose the adjustment. I agree with your feeling that perhaps the code isn't as self documenting as perhaps originally felt, but in order to alter the descriptions (comments) - I think the new comments should be more helpful. Like I pointed out in my response - I was having difficulty figuring out what was being noted by just reading the new comments. After reading the cover, other code, Dan's response to your original series, I started to piece together things which is what I posted. I think at one time I was going to note perhaps adjusting the docs as well, e.g. http://libvirt.org/internals/eventloop.html John
On 28.10.2016 00:58, John Ferlan wrote:
On 10/04/2016 10:27 AM, Nikolay Shirokovskiy wrote:
This is why we should not free callback object synchronously upon removing handle or timeout. Imagine:
1. loop thread, drops the lock and is about to call event callback 2. another thread, enters remove handle and frees callback object 3. loop thread, enters event callback, uses callback object BOOOM --- src/util/vireventpoll.c | 7 +++++++ 1 file changed, 7 insertions(+)
I'm having difficulty trying to decipher the point you're trying to make in the context of the existing comments, the previous upstream series, and the cover letter explanation.
While not explicitly stated for each, the 'flag' that's set is 'deleted'. Once the 'deleted' flag is set, it's expected that the object will be reaped later when it's safer to do so (such as virEventPollCleanupTimeouts and virEventPollCleanupHandles) via the 'ff' passed during virEventAddHandle and virEventAddTimeout.
I'm going to pass on pushing this one.
John
diff --git a/src/util/vireventpoll.c b/src/util/vireventpoll.c index 81ecab4..802b679 100644 --- a/src/util/vireventpoll.c +++ b/src/util/vireventpoll.c @@ -176,6 +176,10 @@ void virEventPollUpdateHandle(int watch, int events) * Unregister a callback from a file handle * NB, it *must* be safe to call this from within a callback * For this reason we only ever set a flag in the existing list. + * Another reason is that as we drop the lock upon event callback + * invocation we can't free callback object if we called + * from a thread then loop thread. + * * Actual deletion will be done out-of-band */ int virEventPollRemoveHandle(int watch) @@ -295,6 +299,9 @@ void virEventPollUpdateTimeout(int timer, int frequency) * Unregister a callback for a timer * NB, it *must* be safe to call this from within a callback * For this reason we only ever set a flag in the existing list. + * Another reason is that as we drop the lock upon event callback + * invocation we can't free callback object if we called + * from a thread then loop thread. * Actual deletion will be done out-of-band */ int virEventPollRemoveTimeout(int timer)
participants (2)
-
John Ferlan
-
Nikolay Shirokovskiy