On 07/09/2018 03:11 AM, Peter Krempa wrote:
On Sat, Jul 07, 2018 at 08:11:05 -0400, John Ferlan wrote:
> When virNetDaemonQuit is called from libvirtd's shutdown
> handler (daemonShutdownHandler) we need to perform the quit
> in multiple steps. The first part is to "request" the quit
> and notify the NetServer's of the impending quit which causes
> the NetServers to inform their workers that a quit was requested.
>
> Still because we cannot guarantee a quit will happen or it's
> possible there's no workers pending, use a virNetDaemonQuitTimer
> to not only break the event loop but keep track of how long we're
> waiting and we've waited too long, force an ungraceful exit so
> that we don't hang waiting forever or cause some sort of SEGV
> because something is still pending and we Unref things.
>
> Signed-off-by: John Ferlan <jferlan(a)redhat.com>
> ---
> src/libvirt_remote.syms | 1 +
> src/remote/remote_daemon.c | 1 +
> src/rpc/virnetdaemon.c | 68 +++++++++++++++++++++++++++++++++++++-
> src/rpc/virnetdaemon.h | 4 +++
> 4 files changed, 73 insertions(+), 1 deletion(-)
[...]
> @@ -855,10 +904,27 @@ virNetDaemonRun(virNetDaemonPtr dmn)
> virObjectLock(dmn);
>
> virHashForEach(dmn->servers, daemonServerProcessClients, NULL);
> +
> + /* HACK: Add a dummy timeout to break event loop */
> + if (dmn->quitRequested && quitTimer == -1)
> + quitTimer = virEventAddTimeout(500, virNetDaemonQuitTimer,
> + &quitCount, NULL);
> +
> + if (dmn->quitRequested && daemonServerWorkersDone(dmn)) {
> + dmn->quit = true;
> + } else {
> + /* Firing every 1/2 second and quitTimeout in seconds, force
> + * an exit when there are still worker threads running and we
> + * have waited long enough */
> + if (quitCount > dmn->quitTimeout * 2)
> + _exit(EXIT_FAILURE);
If you have a legitimate long-running job which would finish eventually
and e.g. write an updated status XML this will break things. I'm not
persuaded that this is a systematic solution to some API getting stuck.
The commit message also does not help persuading me that this is a good
idea.
NACK
I understand the sentiment and this hasn't been a "top of the list" item
for me to think about, but to a degree the purpose of re-posting the
series was to try to come to a consensus over some solution. A naked
NACK almost makes it appear as if you would prefer to not do anything in
this situation, letting nature takes it's course (so to speak).
Do you have any thoughts or suggestions related to what processing you
believe is appropriate if someone issues a SIG{INT|QUIT|TERM} to a
daemon (libvirtd|lockd|logd}? That is what to do if we reach the
cleanup: label in main() of src/remote/remote_daemon.c and
virNetDaemonClose() gets run? Right now IIRC it's either going to be a
SEGV or possible long wait for some worker thread to complete. The
latter is the don't apply this example to cause a wait in block stats
fetch. Naturally the long term wait only matters up through the point
while someone is patient before issuing a SIGKILL.
Do you favor waiting forever for some worker thread to finish? To some
degree if some signal is sent to libvirtd, then I would think the
expectation is to not wait for some indeterminate time. And most
definitely trying to avoid some sort of defunct state resolvable by a
host reboot. Knowing exactly what a worker thread is doing may help, but
that'd take a bit (;-)) of code to trace the stack. Perhaps providing a
list of client/workers still active or even some indication that we're
waiting on some worker rather than no feedback? How much is "too much"?
Another option that was proposed, but didn't really gain any support was
introduction of a stateShutdown in virStateDriver that would be run
during libvirtd's cleanup: processing prior to virNetDaemonClose. That
would be before virStateCleanup. For qemu, the priv->mon and priv->agent
would be closed. That would seemingly generate an error and cause the
worker to exit thus theoretically not needing any forced quit of the
thread "if" the reason was some sort of hang as a result of
monitor/agent processing. Doesn't mean there wouldn't be some other
issue causing a hang. Refreshing my memory on this - there was some
discussion or investigation related to using virStateStop, but I since
that's gated by "WITH_DBUS" being defined, it wasn't clear if/how it
could be used (daemonStop is only called/setup if daemonRunStateInit has
set it up).
John