Races / crashes in shutdown of libvirtd daemon

We got a new BZ filed about a libvirtd crash in shutdown https://bugzilla.redhat.com/show_bug.cgi?id=1828207 We can see from the stack trace that the "interface" driver is in the middle of servicing an RPC call for virConnectListAllInterfaces() Meanwhile the libvirtd daemon is doing virObjectUnref(dmn) on the virNetDaemonPtr object. The fact that it is doing this unref, means that it must have already call virStateCleanup(), given the code sequence: /* Run event loop. */ virNetDaemonRun(dmn); ret = 0; virHookCall(VIR_HOOK_DRIVER_DAEMON, "-", VIR_HOOK_DAEMON_OP_SHUTDOWN, 0, "shutdown", NULL, NULL); cleanup: /* Keep cleanup order in inverse order of startup */ virNetDaemonClose(dmn); virNetlinkEventServiceStopAll(); if (driversInitialized) { /* NB: Possible issue with timing window between driversInitialized * setting if virNetlinkEventServerStart fails */ driversInitialized = false; virStateCleanup(); } virObjectUnref(adminProgram); virObjectUnref(srvAdm); virObjectUnref(qemuProgram); virObjectUnref(lxcProgram); virObjectUnref(remoteProgram); virObjectUnref(srv); virObjectUnref(dmn); Unless I'm missing something non-obvious, this cleanup code path is inherantly broken & racy. When virNetDaemonRun() returns the RPC worker threads are all still active. They are all liable to still be executing RPC calls, which means any of the drivers may be in use. So calling virStateCleanup() is an inherantly dangerous thing to do. There is the further complication that once we have exitted the main loop we may prevent the RPC calls from ever completing, as they may be waiting on an event to be dispatched. I know we're had various patch proposals in the past to improve the robustness of shutdown cleanup but I can't remember the outcome of the reviews. Hopefully people involved in those threads can jump in here... IMHO the key problem here is the virNetDeamonRun() method which just looks at the "quit" flag and immediately returns if it is set. This needs to be changed so that when it sees quit == true, it takes the following actions 1. Call virNetDaemonClose() to drop all RPC clients and thus prevent new RPC calls arriving 2. Flush any RPC calls which are queued but not yet assigned to a worker thread 3. Tell worker threads to exit after finishing their current job 4. Wait for all worker threads to exit 5. Now virNetDaemonRun may return At this point we can call virStateCleanup and the various other things, as we know no drivers are still active in RPC calls. Having said that, there could be background threads in the the drivers which are doing work that uses the event loop thread. So we probably need a virStateClose() method that we call from virNetDaemonRun, *after* all worker threads are gone, which would cleanup any background threads while the event loop is still running. The issue is that step 4 above ("Wait for all worker threads to exit") may take too long, or indeed never complete. To deal with this, it will need a timeout. In the remote_daemon.c cleanup code path, if there are still worker threads present, then we need to skip all cleanup and simply call _exit(0) to terminate the process with no attempt at cleanup, since it would be unsafe to try anything else. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 27.04.2020 18:54, Daniel P. Berrangé wrote:
We got a new BZ filed about a libvirtd crash in shutdown
https://bugzilla.redhat.com/show_bug.cgi?id=1828207
We can see from the stack trace that the "interface" driver is in the middle of servicing an RPC call for virConnectListAllInterfaces()
Meanwhile the libvirtd daemon is doing virObjectUnref(dmn) on the virNetDaemonPtr object.
The fact that it is doing this unref, means that it must have already call virStateCleanup(), given the code sequence:
/* Run event loop. */ virNetDaemonRun(dmn);
ret = 0;
virHookCall(VIR_HOOK_DRIVER_DAEMON, "-", VIR_HOOK_DAEMON_OP_SHUTDOWN, 0, "shutdown", NULL, NULL);
cleanup: /* Keep cleanup order in inverse order of startup */ virNetDaemonClose(dmn);
virNetlinkEventServiceStopAll();
if (driversInitialized) { /* NB: Possible issue with timing window between driversInitialized * setting if virNetlinkEventServerStart fails */ driversInitialized = false; virStateCleanup(); }
virObjectUnref(adminProgram); virObjectUnref(srvAdm); virObjectUnref(qemuProgram); virObjectUnref(lxcProgram); virObjectUnref(remoteProgram); virObjectUnref(srv); virObjectUnref(dmn);
Unless I'm missing something non-obvious, this cleanup code path is inherantly broken & racy. When virNetDaemonRun() returns the RPC worker threads are all still active. They are all liable to still be executing RPC calls, which means any of the drivers may be in use. So calling virStateCleanup() is an inherantly dangerous thing to do. There is the further complication that once we have exitted the main loop we may prevent the RPC calls from ever completing, as they may be waiting on an event to be dispatched.
I know we're had various patch proposals in the past to improve the robustness of shutdown cleanup but I can't remember the outcome of the reviews. Hopefully people involved in those threads can jump in here...
Hi, all! Yeah I and John Ferlan attempted to address the issue in the past. The last reference I found is [1]. One can dig down the history of the issue thru the links inside. IIRC the latest approach was similar to what you suggested below namely run event loop for a while after quit is seen in order to let RPC worker threads waiting for event to finish. [1] https://www.redhat.com/archives/libvir-list/2018-July/msg00382.html
IMHO the key problem here is the virNetDeamonRun() method which just looks at the "quit" flag and immediately returns if it is set. This needs to be changed so that when it sees quit == true, it takes the following actions
1. Call virNetDaemonClose() to drop all RPC clients and thus prevent new RPC calls arriving 2. Flush any RPC calls which are queued but not yet assigned to a worker thread 3. Tell worker threads to exit after finishing their current job 4. Wait for all worker threads to exit 5. Now virNetDaemonRun may return
At this point we can call virStateCleanup and the various other things, as we know no drivers are still active in RPC calls.
Having said that, there could be background threads in the the drivers which are doing work that uses the event loop thread.
So we probably need a virStateClose() method that we call from virNetDaemonRun, *after* all worker threads are gone, which would cleanup any background threads while the event loop is still running.
I guess RPC worker threads may need some signal to finish in time too. For example in case of migration. Nikolay
The issue is that step 4 above ("Wait for all worker threads to exit") may take too long, or indeed never complete. To deal with this, it will need a timeout. In the remote_daemon.c cleanup code path, if there are still worker threads present, then we need to skip all cleanup and simply call _exit(0) to terminate the process with no attempt at cleanup, since it would be unsafe to try anything else.
Regards, Daniel

On 4/27/20 5:54 PM, Daniel P. Berrangé wrote:
We got a new BZ filed about a libvirtd crash in shutdown
And there is another one: https://bugzilla.redhat.com/show_bug.cgi?id=1895359 The problem is that when host is shutting down we get PrepareForShutdown() signal on dbus and spawn a thread that will eventually call qemuStateStop(). But before it gets a chance to run the main thread quits the event loop and calls qemuStateCleanup() freeing the qemu driver. After all this the dbus signal handling thread gets to run only to find qemu_driver=0x0 and thus crash. From the fact that the event loop quit I deduct that we were sent a SIGTERM which means there was no guest running otherwise we would inhibit the shutdown, so qemuStateStop() would be a NOP anyway. So maybe the fix for this particular case indeed is to check whether qemu_driver == 0x0. Anyway, I think your idea is sound. Michal

On 12.11.2020 20:12, Michal Privoznik wrote:
On 4/27/20 5:54 PM, Daniel P. Berrangé wrote:
We got a new BZ filed about a libvirtd crash in shutdown
And there is another one:
https://bugzilla.redhat.com/show_bug.cgi?id=1895359
The problem is that when host is shutting down we get PrepareForShutdown() signal on dbus and spawn a thread that will eventually call qemuStateStop(). But before it gets a chance to run the main thread quits the event loop and calls qemuStateCleanup() freeing the qemu driver. After all this the dbus signal handling thread gets to run only to find qemu_driver=0x0 and thus crash.
From the fact that the event loop quit I deduct that we were sent a SIGTERM which means there was no guest running otherwise we would inhibit the shutdown, so qemuStateStop() would be a NOP anyway. So maybe the fix for this particular case indeed is to check whether qemu_driver == 0x0.
Anyway, I think your idea is sound.
Hi, Michal. I added code to handle shutdown as Dan proposed in [1]. It handles most busy threads which are rpc and worker and per-VM event loop threads in qemu driver. But there are still other drivers and short living threads here and there [2]. So thread that spawns daemonStop is on of these untamed threads. So we need to join this thread in daemonShutdownWait. [1] [PATCH v2 00/13] resolve hangs/crashes on libvirtd shutdown https://www.redhat.com/archives/libvir-list/2020-July/msg01606.html [2] Re: [PATCH v2 00/13] resolve hangs/crashes on libvirtd shutdown https://www.redhat.com/archives/libvir-list/2020-September/msg00319.html Nikolay

On 11/13/20 7:34 AM, Nikolay Shirokovskiy wrote:
On 12.11.2020 20:12, Michal Privoznik wrote:
On 4/27/20 5:54 PM, Daniel P. Berrangé wrote:
We got a new BZ filed about a libvirtd crash in shutdown
And there is another one:
https://bugzilla.redhat.com/show_bug.cgi?id=1895359
The problem is that when host is shutting down we get PrepareForShutdown() signal on dbus and spawn a thread that will eventually call qemuStateStop(). But before it gets a chance to run the main thread quits the event loop and calls qemuStateCleanup() freeing the qemu driver. After all this the dbus signal handling thread gets to run only to find qemu_driver=0x0 and thus crash.
From the fact that the event loop quit I deduct that we were sent a SIGTERM which means there was no guest running otherwise we would inhibit the shutdown, so qemuStateStop() would be a NOP anyway. So maybe the fix for this particular case indeed is to check whether qemu_driver == 0x0.
Anyway, I think your idea is sound.
Hi, Michal.
I added code to handle shutdown as Dan proposed in [1]. It handles most busy threads which are rpc and worker and per-VM event loop threads in qemu driver. But there are still other drivers and short living threads here and there [2]. So thread that spawns daemonStop is on of these untamed threads.
So we need to join this thread in daemonShutdownWait.
Ah, so are you saying that my patch is not correct? https://www.redhat.com/archives/libvir-list/2020-November/thread.html and I see you replied, so let's continue discussion there. Michal
participants (3)
-
Daniel P. Berrangé
-
Michal Privoznik
-
Nikolay Shirokovskiy