On 27.04.2020 18:54, Daniel P. Berrangé wrote:
We got a new BZ filed about a libvirtd crash in shutdown
https://bugzilla.redhat.com/show_bug.cgi?id=1828207
We can see from the stack trace that the "interface" driver is in
the middle of servicing an RPC call for virConnectListAllInterfaces()
Meanwhile the libvirtd daemon is doing virObjectUnref(dmn) on the
virNetDaemonPtr object.
The fact that it is doing this unref, means that it must have already
call virStateCleanup(), given the code sequence:
/* Run event loop. */
virNetDaemonRun(dmn);
ret = 0;
virHookCall(VIR_HOOK_DRIVER_DAEMON, "-", VIR_HOOK_DAEMON_OP_SHUTDOWN,
0, "shutdown", NULL, NULL);
cleanup:
/* Keep cleanup order in inverse order of startup */
virNetDaemonClose(dmn);
virNetlinkEventServiceStopAll();
if (driversInitialized) {
/* NB: Possible issue with timing window between driversInitialized
* setting if virNetlinkEventServerStart fails */
driversInitialized = false;
virStateCleanup();
}
virObjectUnref(adminProgram);
virObjectUnref(srvAdm);
virObjectUnref(qemuProgram);
virObjectUnref(lxcProgram);
virObjectUnref(remoteProgram);
virObjectUnref(srv);
virObjectUnref(dmn);
Unless I'm missing something non-obvious, this cleanup code path is
inherantly broken & racy. When virNetDaemonRun() returns the RPC
worker threads are all still active. They are all liable to still
be executing RPC calls, which means any of the drivers may be in
use. So calling virStateCleanup() is an inherantly dangerous
thing to do. There is the further complication that once we have
exitted the main loop we may prevent the RPC calls from ever
completing, as they may be waiting on an event to be dispatched.
I know we're had various patch proposals in the past to improve the
robustness of shutdown cleanup but I can't remember the outcome of the
reviews. Hopefully people involved in those threads can jump in here...
Hi, all!
Yeah I and John Ferlan attempted to address the issue in the past.
The last reference I found is [1]. One can dig down the history
of the issue thru the links inside.
IIRC the latest approach was similar to what you suggested below namely
run event loop for a while after quit is seen in order to let
RPC worker threads waiting for event to finish.
[1]
https://www.redhat.com/archives/libvir-list/2018-July/msg00382.html
IMHO the key problem here is the virNetDeamonRun() method which just
looks at the "quit" flag and immediately returns if it is set.
This needs to be changed so that when it sees quit == true, it takes
the following actions
1. Call virNetDaemonClose() to drop all RPC clients and thus prevent
new RPC calls arriving
2. Flush any RPC calls which are queued but not yet assigned to a
worker thread
3. Tell worker threads to exit after finishing their current job
4. Wait for all worker threads to exit
5. Now virNetDaemonRun may return
At this point we can call virStateCleanup and the various other
things, as we know no drivers are still active in RPC calls.
Having said that, there could be background threads in the the
drivers which are doing work that uses the event loop thread.
So we probably need a virStateClose() method that we call from
virNetDaemonRun, *after* all worker threads are gone, which would
cleanup any background threads while the event loop is still
running.
I guess RPC worker threads may need some signal to finish in time too.
For example in case of migration.
Nikolay
The issue is that step 4 above ("Wait for all worker threads to exit")
may take too long, or indeed never complete. To deal with this, it
will need a timeout. In the remote_daemon.c cleanup code path, if
there are still worker threads present, then we need to skip all
cleanup and simply call _exit(0) to terminate the process with no
attempt at cleanup, since it would be unsafe to try anything else.
Regards,
Daniel