[libvirt] [PATCH 0/9] Resolve libvirtd hang on termination with connected long running client

RFC: https://www.redhat.com/archives/libvir-list/2018-January/msg00318.html Adjustments since RFC... Patches 1&2: No change, were already R-B'd Patch 3: Removed code as noted in code review, update commit message Patch 4: From old series removed, see below for more details Patch 9: no change NB: Patches 5-8 and 10 from Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com> are removed as they seemed to not be necessary Replaced the former patch 4 with series of patches to (slowly) provide support to disable new connections, handle removing waiting jobs, causing the waiting workers to quit, and allow any running jobs to complete. As it turns out, waiting for running jobs to complete cannot be done from the virNetServerClose callbacks because that means the event loop processing done during virNetServerRun will not allow any currently long running worker job thread a means to complete. So when virNetDaemonQuit is called as a result of the libvirtd signal handlers for SIG{QUIT|INT|TERM}, instead of just causing virNetServerRun to quit immediately, alter to using a quitRequested flag and then use that quitRequested flag to check for long running worker threads before causing the event loop to quit resulting in libvirtd being able to run through the virNetDaemonClose processing. John Ferlan (9): libvirtd: Alter refcnt processing for domain server objects libvirtd: Alter refcnt processing for server program objects netserver: Remove ServiceToggle during ServerDispose util: Introduce virThreadPoolDrain rpc: Introduce virNetServerQuitRequested rpc: Introduce virNetServerWorkerCount rpc: Alter virNetDaemonQuit processing docs: Add news article for libvirtd issue APPLY ONLY FOR TESTING PURPOSES daemon/libvirtd.c | 43 +++++++++++++++++++++++--------- docs/news.xml | 12 +++++++++ src/libvirt_private.syms | 1 + src/libvirt_remote.syms | 2 ++ src/qemu/qemu_driver.c | 5 ++++ src/rpc/virnetdaemon.c | 45 +++++++++++++++++++++++++++++++++- src/rpc/virnetserver.c | 52 ++++++++++++++++++++++++++++++++++++--- src/rpc/virnetserver.h | 4 +++ src/util/virthreadpool.c | 64 ++++++++++++++++++++++++++++++++++++++++-------- src/util/virthreadpool.h | 2 ++ 10 files changed, 204 insertions(+), 26 deletions(-) -- 2.13.6

Once virNetDaemonAddServer returns, the @srv or @srvAdm have either been added to the @dmn list increasing the refcnt or there was an error. In either case, we can lower the refcnt from virNetServerNew, but not set to NULL. Thus "using" the hash table reference when adding programs later or allowing the immediate free of the object on error. The @dmn dispose function (virNetDaemonDispose) will handle the Unref of each object during the virHashFree when the object is removed from the hash table. Reviewed-by: Erik Skultety <eskultet@redhat.com> Signed-off-by: John Ferlan <jferlan@redhat.com> --- daemon/libvirtd.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/daemon/libvirtd.c b/daemon/libvirtd.c index 6d3b83355..0afd1dd82 100644 --- a/daemon/libvirtd.c +++ b/daemon/libvirtd.c @@ -1065,6 +1065,7 @@ int main(int argc, char **argv) { char *remote_config_file = NULL; int statuswrite = -1; int ret = 1; + int rc; int pid_file_fd = -1; char *pid_file = NULL; char *sock_file = NULL; @@ -1319,7 +1320,11 @@ int main(int argc, char **argv) { goto cleanup; } - if (virNetDaemonAddServer(dmn, srv) < 0) { + /* Add @srv to @dmn servers hash table and Unref to allow removal from + * hash table at @dmn disposal to decide when to free @srv */ + rc = virNetDaemonAddServer(dmn, srv); + virObjectUnref(srv); + if (rc < 0) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } @@ -1393,7 +1398,11 @@ int main(int argc, char **argv) { goto cleanup; } - if (virNetDaemonAddServer(dmn, srvAdm) < 0) { + /* Add @srvAdm to @dmn servers hash table and Unref to allow removal from + * hash table at @dmn disposal to decide when to free @srvAdm */ + rc = virNetDaemonAddServer(dmn, srvAdm); + virObjectUnref(srvAdm); + if (rc < 0) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } @@ -1516,11 +1525,9 @@ int main(int argc, char **argv) { } virObjectUnref(adminProgram); - virObjectUnref(srvAdm); virObjectUnref(qemuProgram); virObjectUnref(lxcProgram); virObjectUnref(remoteProgram); - virObjectUnref(srv); virObjectUnref(dmn); virNetlinkShutdown(); -- 2.13.6

On Fri, Jan 19, 2018 at 06:23 PM +0100, John Ferlan <jferlan@redhat.com> wrote:
Once virNetDaemonAddServer returns, the @srv or @srvAdm have either been added to the @dmn list increasing the refcnt or there was an error. In either case, we can lower the refcnt from virNetServerNew, but not set to NULL. Thus "using" the hash table reference when adding programs later or allowing the immediate free of the object on error.
The @dmn dispose function (virNetDaemonDispose) will handle the Unref of each object during the virHashFree when the object is removed from the hash table.
Reviewed-by: Erik Skultety <eskultet@redhat.com> Signed-off-by: John Ferlan <jferlan@redhat.com> --- daemon/libvirtd.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/daemon/libvirtd.c b/daemon/libvirtd.c index 6d3b83355..0afd1dd82 100644 --- a/daemon/libvirtd.c +++ b/daemon/libvirtd.c @@ -1065,6 +1065,7 @@ int main(int argc, char **argv) { char *remote_config_file = NULL; int statuswrite = -1; int ret = 1; + int rc; int pid_file_fd = -1; char *pid_file = NULL; char *sock_file = NULL; @@ -1319,7 +1320,11 @@ int main(int argc, char **argv) { goto cleanup; }
- if (virNetDaemonAddServer(dmn, srv) < 0) { + /* Add @srv to @dmn servers hash table and Unref to allow removal from + * hash table at @dmn disposal to decide when to free @srv */ + rc = virNetDaemonAddServer(dmn, srv);
Reviewed-by: Marc Hartmayer <mhartmay@linux.vnet.ibm.com> Wouldn’t it be cleaner to initialize the server and then add it to the daemon? (disadvantage: it makes the correct unreferencing a little more complicated). <pseudo_code> create_and_initialize_daemon create_and_add_programs_server virNetDaemonAddServer() … </pseudo_code>
+ virObjectUnref(srv); + if (rc < 0) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } @@ -1393,7 +1398,11 @@ int main(int argc, char **argv) { goto cleanup; }
- if (virNetDaemonAddServer(dmn, srvAdm) < 0) { + /* Add @srvAdm to @dmn servers hash table and Unref to allow removal from + * hash table at @dmn disposal to decide when to free @srvAdm */ + rc = virNetDaemonAddServer(dmn, srvAdm); + virObjectUnref(srvAdm); + if (rc < 0) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } @@ -1516,11 +1525,9 @@ int main(int argc, char **argv) { }
virObjectUnref(adminProgram); - virObjectUnref(srvAdm); virObjectUnref(qemuProgram); virObjectUnref(lxcProgram); virObjectUnref(remoteProgram); - virObjectUnref(srv); virObjectUnref(dmn);
virNetlinkShutdown(); -- 2.13.6
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- Beste Grüße / Kind regards Marc Hartmayer IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

Once virNetServerProgramNew returns, the program objects have either been added to the @srv or @srvAdm list increasing the refcnt or there was an error. In either case, we can lower the refcnt from virNetServerProgramNew and set the object to NULL since it won't be used any more. The @srv or @srvAdm dispose function (virNetServerDispose) will handle the Unref of each object during the virHashFree when the object is removed from the hash table. Reviewed-by: Erik Skultety <eskultet@redhat.com> Signed-off-by: John Ferlan <jferlan@redhat.com> --- daemon/libvirtd.c | 28 ++++++++++++++++++++-------- 1 file changed, 20 insertions(+), 8 deletions(-) diff --git a/daemon/libvirtd.c b/daemon/libvirtd.c index 0afd1dd82..b47f875d9 100644 --- a/daemon/libvirtd.c +++ b/daemon/libvirtd.c @@ -1352,7 +1352,11 @@ int main(int argc, char **argv) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } - if (virNetServerAddProgram(srv, remoteProgram) < 0) { + + rc = virNetServerAddProgram(srv, remoteProgram); + virObjectUnref(remoteProgram); + remoteProgram = NULL; + if (rc < 0) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } @@ -1364,7 +1368,11 @@ int main(int argc, char **argv) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } - if (virNetServerAddProgram(srv, lxcProgram) < 0) { + + rc = virNetServerAddProgram(srv, lxcProgram); + virObjectUnref(lxcProgram); + lxcProgram = NULL; + if (rc < 0) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } @@ -1376,7 +1384,11 @@ int main(int argc, char **argv) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } - if (virNetServerAddProgram(srv, qemuProgram) < 0) { + + rc = virNetServerAddProgram(srv, qemuProgram); + virObjectUnref(qemuProgram); + qemuProgram = NULL; + if (rc < 0) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } @@ -1414,7 +1426,11 @@ int main(int argc, char **argv) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } - if (virNetServerAddProgram(srvAdm, adminProgram) < 0) { + + rc = virNetServerAddProgram(srvAdm, adminProgram); + virObjectUnref(adminProgram); + adminProgram = NULL; + if (rc < 0) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } @@ -1524,10 +1540,6 @@ int main(int argc, char **argv) { virStateCleanup(); } - virObjectUnref(adminProgram); - virObjectUnref(qemuProgram); - virObjectUnref(lxcProgram); - virObjectUnref(remoteProgram); virObjectUnref(dmn); virNetlinkShutdown(); -- 2.13.6

On Fri, Jan 19, 2018 at 06:23 PM +0100, John Ferlan <jferlan@redhat.com> wrote:
Once virNetServerProgramNew returns, the program objects have either been added to the @srv or @srvAdm list increasing the refcnt or there was an error. In either case, we can lower the refcnt from virNetServerProgramNew and set the object to NULL since it won't be used any more.
The @srv or @srvAdm dispose function (virNetServerDispose) will handle the Unref of each object during the virHashFree when the object is removed from the hash table.
Reviewed-by: Erik Skultety <eskultet@redhat.com> Signed-off-by: John Ferlan <jferlan@redhat.com> --- daemon/libvirtd.c | 28 ++++++++++++++++++++-------- 1 file changed, 20 insertions(+), 8 deletions(-)
Reviewed-by: Marc Hartmayer <mhartmay@linux.vnet.ibm.com> -- Beste Grüße / Kind regards Marc Hartmayer IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

On Fri, Jan 19, 2018 at 06:23 PM +0100, John Ferlan <jferlan@redhat.com> wrote:
Once virNetServerProgramNew returns, the program objects have either been added to the @srv or @srvAdm list increasing the refcnt or there was an error. In either case, we can lower the refcnt from virNetServerProgramNew and set the object to NULL since it won't be used any more.
The @srv or @srvAdm dispose function (virNetServerDispose) will handle the Unref of each object during the virHashFree when the object is removed from the hash table.
Reviewed-by: Erik Skultety <eskultet@redhat.com> Signed-off-by: John Ferlan <jferlan@redhat.com> --- daemon/libvirtd.c | 28 ++++++++++++++++++++-------- 1 file changed, 20 insertions(+), 8 deletions(-)
diff --git a/daemon/libvirtd.c b/daemon/libvirtd.c index 0afd1dd82..b47f875d9 100644 --- a/daemon/libvirtd.c +++ b/daemon/libvirtd.c @@ -1352,7 +1352,11 @@ int main(int argc, char **argv) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } - if (virNetServerAddProgram(srv, remoteProgram) < 0) { + + rc = virNetServerAddProgram(srv, remoteProgram); + virObjectUnref(remoteProgram); + remoteProgram = NULL;
I’ve just looked at the code and all those variables are a global variables… libvirtd.c: virNetServerProgramPtr remoteProgram = NULL; virNetServerProgramPtr adminProgram = NULL; virNetServerProgramPtr qemuProgram = NULL; virNetServerProgramPtr lxcProgram = NULL; So this is not a good idea to set them to NULL… As this results in a segmentation fault [Current thread is 1 (Thread 0x3ff908d7910 (LWP 195236))] (gdb) bt #0 virNetServerProgramGetID (prog=prog@entry=0x0) at ../../src/rpc/virnetserverprogram.c:93 #1 0x000000011b726ac2 in remoteDispatchObjectEventSend (client=<optimized out>, program=0x0, procnr=procnr@entry=319, proc=0x11b767120 <xdr_remote_domain_event_callback_reboot_msg>, data=data@entry=0x3fffee7d948) at ../../daemon/remote.c:3997 #2 0x000000011b7577d2 in remoteRelayDomainEventReboot (conn=<optimized out>, dom=0x143663320, opaque=0x3ff700330c0) at ../../daemon/remote.c:365 #3 0x000003ff904173a8 in virDomainEventDispatchDefaultFunc (conn=0x3fef42ca860, event=0x1436a37b0, cb=0x11b7576d0 <remoteRelayDomainEventReboot>, cbopaque=0x3ff700330c0) at ../../src/conf/domain_event.c:1787 #4 0x000003ff90414e4e in virObjectEventStateDispatchCallbacks (state=state@entry=0x3ff2c0facf0, event=0x1436a37b0, callbacks=callbacks@entry=0x3ff2c106370) at ../../src/conf/object_event.c:715 #5 0x000003ff90414eb4 in virObjectEventStateQueueDispatch (state=state@entry=0x3ff2c0facf0, queue=queue@entry=0x3fffee7dc10, callbacks=0x3ff2c106370) at ../../src/conf/object_event.c:729 #6 0x000003ff904156d4 in virObjectEventStateFlush (state=0x3ff2c0facf0) at ../../src/conf/object_event.c:830 #7 0x000003ff9041574e in virObjectEventTimer (timer=<optimized out>, opaque=<optimized out>) at ../../src/conf/object_event.c:560 #8 0x000003ff90343cc4 in virEventPollDispatchTimeouts () at ../../src/util/vireventpoll.c:457 #9 0x000003ff9034577c in virEventPollRunOnce () at ../../src/util/vireventpoll.c:653 #10 0x000003ff90343852 in virEventRunDefaultImpl () at ../../src/util/virevent.c:327 #11 0x000003ff904e5e78 in virNetDaemonRun (dmn=0x1436303b0) at ../../src/rpc/virnetdaemon.c:880 #12 0x000000011b7256ca in main (argc=<optimized out>, argv=<optimized out>) at ../../daemon/libvirtd.c:1523 Therefore, I’ll withdraw my Reviewed-by: Marc Hartmayer <mhartmay@linux.vnet.ibm.com>.
+ if (rc < 0) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } @@ -1364,7 +1368,11 @@ int main(int argc, char **argv) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } - if (virNetServerAddProgram(srv, lxcProgram) < 0) { + + rc = virNetServerAddProgram(srv, lxcProgram); + virObjectUnref(lxcProgram); + lxcProgram = NULL; + if (rc < 0) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } @@ -1376,7 +1384,11 @@ int main(int argc, char **argv) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } - if (virNetServerAddProgram(srv, qemuProgram) < 0) { + + rc = virNetServerAddProgram(srv, qemuProgram); + virObjectUnref(qemuProgram); + qemuProgram = NULL; + if (rc < 0) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } @@ -1414,7 +1426,11 @@ int main(int argc, char **argv) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } - if (virNetServerAddProgram(srvAdm, adminProgram) < 0) { + + rc = virNetServerAddProgram(srvAdm, adminProgram); + virObjectUnref(adminProgram); + adminProgram = NULL; + if (rc < 0) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } @@ -1524,10 +1540,6 @@ int main(int argc, char **argv) { virStateCleanup(); }
- virObjectUnref(adminProgram); - virObjectUnref(qemuProgram); - virObjectUnref(lxcProgram); - virObjectUnref(remoteProgram); virObjectUnref(dmn);
virNetlinkShutdown(); -- 2.13.6
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- Beste Grüße / Kind regards Marc Hartmayer IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

On 01/24/2018 05:12 AM, Marc Hartmayer wrote:
On Fri, Jan 19, 2018 at 06:23 PM +0100, John Ferlan <jferlan@redhat.com> wrote:
Once virNetServerProgramNew returns, the program objects have either been added to the @srv or @srvAdm list increasing the refcnt or there was an error. In either case, we can lower the refcnt from virNetServerProgramNew and set the object to NULL since it won't be used any more.
The @srv or @srvAdm dispose function (virNetServerDispose) will handle the Unref of each object during the virHashFree when the object is removed from the hash table.
Reviewed-by: Erik Skultety <eskultet@redhat.com> Signed-off-by: John Ferlan <jferlan@redhat.com> --- daemon/libvirtd.c | 28 ++++++++++++++++++++-------- 1 file changed, 20 insertions(+), 8 deletions(-)
diff --git a/daemon/libvirtd.c b/daemon/libvirtd.c index 0afd1dd82..b47f875d9 100644 --- a/daemon/libvirtd.c +++ b/daemon/libvirtd.c @@ -1352,7 +1352,11 @@ int main(int argc, char **argv) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } - if (virNetServerAddProgram(srv, remoteProgram) < 0) { + + rc = virNetServerAddProgram(srv, remoteProgram); + virObjectUnref(remoteProgram); + remoteProgram = NULL;
I’ve just looked at the code and all those variables are a global variables…
libvirtd.c: virNetServerProgramPtr remoteProgram = NULL; virNetServerProgramPtr adminProgram = NULL; virNetServerProgramPtr qemuProgram = NULL; virNetServerProgramPtr lxcProgram = NULL;
So this is not a good idea to set them to NULL… As this results in a segmentation fault
Oh right... There really was no reason to set them to NULL since the only time they're used is in the context they are used. If the AddProgram was successful, then there would be 2 refs anyway. They're not used later in the code, so I'll remove the = NULL after the Unref. John
[Current thread is 1 (Thread 0x3ff908d7910 (LWP 195236))] (gdb) bt #0 virNetServerProgramGetID (prog=prog@entry=0x0) at ../../src/rpc/virnetserverprogram.c:93 #1 0x000000011b726ac2 in remoteDispatchObjectEventSend (client=<optimized out>, program=0x0, procnr=procnr@entry=319, proc=0x11b767120 <xdr_remote_domain_event_callback_reboot_msg>, data=data@entry=0x3fffee7d948) at ../../daemon/remote.c:3997 #2 0x000000011b7577d2 in remoteRelayDomainEventReboot (conn=<optimized out>, dom=0x143663320, opaque=0x3ff700330c0) at ../../daemon/remote.c:365 #3 0x000003ff904173a8 in virDomainEventDispatchDefaultFunc (conn=0x3fef42ca860, event=0x1436a37b0, cb=0x11b7576d0 <remoteRelayDomainEventReboot>, cbopaque=0x3ff700330c0) at ../../src/conf/domain_event.c:1787 #4 0x000003ff90414e4e in virObjectEventStateDispatchCallbacks (state=state@entry=0x3ff2c0facf0, event=0x1436a37b0, callbacks=callbacks@entry=0x3ff2c106370) at ../../src/conf/object_event.c:715 #5 0x000003ff90414eb4 in virObjectEventStateQueueDispatch (state=state@entry=0x3ff2c0facf0, queue=queue@entry=0x3fffee7dc10, callbacks=0x3ff2c106370) at ../../src/conf/object_event.c:729 #6 0x000003ff904156d4 in virObjectEventStateFlush (state=0x3ff2c0facf0) at ../../src/conf/object_event.c:830 #7 0x000003ff9041574e in virObjectEventTimer (timer=<optimized out>, opaque=<optimized out>) at ../../src/conf/object_event.c:560 #8 0x000003ff90343cc4 in virEventPollDispatchTimeouts () at ../../src/util/vireventpoll.c:457 #9 0x000003ff9034577c in virEventPollRunOnce () at ../../src/util/vireventpoll.c:653 #10 0x000003ff90343852 in virEventRunDefaultImpl () at ../../src/util/virevent.c:327 #11 0x000003ff904e5e78 in virNetDaemonRun (dmn=0x1436303b0) at ../../src/rpc/virnetdaemon.c:880 #12 0x000000011b7256ca in main (argc=<optimized out>, argv=<optimized out>) at ../../daemon/libvirtd.c:1523
Therefore, I’ll withdraw my Reviewed-by: Marc Hartmayer <mhartmay@linux.vnet.ibm.com>.
+ if (rc < 0) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } @@ -1364,7 +1368,11 @@ int main(int argc, char **argv) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } - if (virNetServerAddProgram(srv, lxcProgram) < 0) { + + rc = virNetServerAddProgram(srv, lxcProgram); + virObjectUnref(lxcProgram); + lxcProgram = NULL; + if (rc < 0) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } @@ -1376,7 +1384,11 @@ int main(int argc, char **argv) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } - if (virNetServerAddProgram(srv, qemuProgram) < 0) { + + rc = virNetServerAddProgram(srv, qemuProgram); + virObjectUnref(qemuProgram); + qemuProgram = NULL; + if (rc < 0) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } @@ -1414,7 +1426,11 @@ int main(int argc, char **argv) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } - if (virNetServerAddProgram(srvAdm, adminProgram) < 0) { + + rc = virNetServerAddProgram(srvAdm, adminProgram); + virObjectUnref(adminProgram); + adminProgram = NULL; + if (rc < 0) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } @@ -1524,10 +1540,6 @@ int main(int argc, char **argv) { virStateCleanup(); }
- virObjectUnref(adminProgram); - virObjectUnref(qemuProgram); - virObjectUnref(lxcProgram); - virObjectUnref(remoteProgram); virObjectUnref(dmn);
virNetlinkShutdown(); -- 2.13.6
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- Beste Grüße / Kind regards Marc Hartmayer
IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

On Wed, Jan 24, 2018 at 01:03 PM +0100, John Ferlan <jferlan@redhat.com> wrote:
On 01/24/2018 05:12 AM, Marc Hartmayer wrote:
On Fri, Jan 19, 2018 at 06:23 PM +0100, John Ferlan <jferlan@redhat.com> wrote:
Once virNetServerProgramNew returns, the program objects have either been added to the @srv or @srvAdm list increasing the refcnt or there was an error. In either case, we can lower the refcnt from virNetServerProgramNew and set the object to NULL since it won't be used any more.
The @srv or @srvAdm dispose function (virNetServerDispose) will handle the Unref of each object during the virHashFree when the object is removed from the hash table.
Reviewed-by: Erik Skultety <eskultet@redhat.com> Signed-off-by: John Ferlan <jferlan@redhat.com> --- daemon/libvirtd.c | 28 ++++++++++++++++++++-------- 1 file changed, 20 insertions(+), 8 deletions(-)
diff --git a/daemon/libvirtd.c b/daemon/libvirtd.c index 0afd1dd82..b47f875d9 100644 --- a/daemon/libvirtd.c +++ b/daemon/libvirtd.c @@ -1352,7 +1352,11 @@ int main(int argc, char **argv) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } - if (virNetServerAddProgram(srv, remoteProgram) < 0) { + + rc = virNetServerAddProgram(srv, remoteProgram); + virObjectUnref(remoteProgram); + remoteProgram = NULL;
I’ve just looked at the code and all those variables are a global variables…
libvirtd.c: virNetServerProgramPtr remoteProgram = NULL; virNetServerProgramPtr adminProgram = NULL; virNetServerProgramPtr qemuProgram = NULL; virNetServerProgramPtr lxcProgram = NULL;
So this is not a good idea to set them to NULL… As this results in a segmentation fault
Oh right... There really was no reason to set them to NULL since the only time they're used is in the context they are used. If the AddProgram was successful, then there would be 2 refs anyway. They're not used later in the code, so I'll remove the = NULL after the Unref.
Is it possible to get rid of these global variables? At least for adminProgram and lxcProgram this should be easily to achieve as they aren’t used anywhere else than in this function context. libvirtd.h: (Only these variables are declared as extern) extern virNetServerProgramPtr remoteProgram; extern virNetServerProgramPtr qemuProgram;
John
[…snip…] -- Beste Grüße / Kind regards Marc Hartmayer IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

On 01/24/2018 07:48 AM, Marc Hartmayer wrote:
On Wed, Jan 24, 2018 at 01:03 PM +0100, John Ferlan <jferlan@redhat.com> wrote:
On 01/24/2018 05:12 AM, Marc Hartmayer wrote:
On Fri, Jan 19, 2018 at 06:23 PM +0100, John Ferlan <jferlan@redhat.com> wrote:
Once virNetServerProgramNew returns, the program objects have either been added to the @srv or @srvAdm list increasing the refcnt or there was an error. In either case, we can lower the refcnt from virNetServerProgramNew and set the object to NULL since it won't be used any more.
The @srv or @srvAdm dispose function (virNetServerDispose) will handle the Unref of each object during the virHashFree when the object is removed from the hash table.
Reviewed-by: Erik Skultety <eskultet@redhat.com> Signed-off-by: John Ferlan <jferlan@redhat.com> --- daemon/libvirtd.c | 28 ++++++++++++++++++++-------- 1 file changed, 20 insertions(+), 8 deletions(-)
diff --git a/daemon/libvirtd.c b/daemon/libvirtd.c index 0afd1dd82..b47f875d9 100644 --- a/daemon/libvirtd.c +++ b/daemon/libvirtd.c @@ -1352,7 +1352,11 @@ int main(int argc, char **argv) { ret = VIR_DAEMON_ERR_INIT; goto cleanup; } - if (virNetServerAddProgram(srv, remoteProgram) < 0) { + + rc = virNetServerAddProgram(srv, remoteProgram); + virObjectUnref(remoteProgram); + remoteProgram = NULL;
I’ve just looked at the code and all those variables are a global variables…
libvirtd.c: virNetServerProgramPtr remoteProgram = NULL; virNetServerProgramPtr adminProgram = NULL; virNetServerProgramPtr qemuProgram = NULL; virNetServerProgramPtr lxcProgram = NULL;
So this is not a good idea to set them to NULL… As this results in a segmentation fault
Oh right... There really was no reason to set them to NULL since the only time they're used is in the context they are used. If the AddProgram was successful, then there would be 2 refs anyway. They're not used later in the code, so I'll remove the = NULL after the Unref.
Is it possible to get rid of these global variables? At least for adminProgram and lxcProgram this should be easily to achieve as they aren’t used anywhere else than in this function context.
I suppose so, not sure if that was within the context of what I was doing. It'd have to be a separate "pre-patch" - in fact doable without this series. I also know Dan's been working through adding admin protocol support for logd/lockd in another series... John
libvirtd.h: (Only these variables are declared as extern) extern virNetServerProgramPtr remoteProgram; extern virNetServerProgramPtr qemuProgram;
John
[…snip…]
-- Beste Grüße / Kind regards Marc Hartmayer
IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

No sense in calling ServiceToggle for all nservices during ServiceDispose since ServerClose calls ServiceClose which removes the IOCallback that's being toggled via ServiceToggle. Signed-off-by: John Ferlan <jferlan@redhat.com> Reviewed-by: Erik Skultety <eskultet@redhat.com> --- src/rpc/virnetserver.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/src/rpc/virnetserver.c b/src/rpc/virnetserver.c index 77a4c0b8d..7bab11efb 100644 --- a/src/rpc/virnetserver.c +++ b/src/rpc/virnetserver.c @@ -805,9 +805,6 @@ void virNetServerDispose(void *obj) VIR_FREE(srv->name); - for (i = 0; i < srv->nservices; i++) - virNetServerServiceToggle(srv->services[i], false); - virThreadPoolFree(srv->workers); for (i = 0; i < srv->nservices; i++) -- 2.13.6

Create a mechanism to allow the domain/server quit code to be able to cause any pending jobs to be be purged and request current workers to quit. Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/libvirt_private.syms | 1 + src/util/virthreadpool.c | 64 ++++++++++++++++++++++++++++++++++++++++-------- src/util/virthreadpool.h | 2 ++ 3 files changed, 57 insertions(+), 10 deletions(-) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index bc8cc1fba..6ffceb46b 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2865,6 +2865,7 @@ virThreadJobSetWorker; # util/virthreadpool.h +virThreadPoolDrain; virThreadPoolFree; virThreadPoolGetCurrentWorkers; virThreadPoolGetFreeWorkers; diff --git a/src/util/virthreadpool.c b/src/util/virthreadpool.c index 10f2bd2c3..0baa05d12 100644 --- a/src/util/virthreadpool.c +++ b/src/util/virthreadpool.c @@ -30,9 +30,12 @@ #include "viralloc.h" #include "virthread.h" #include "virerror.h" +#include "virlog.h" #define VIR_FROM_THIS VIR_FROM_NONE +VIR_LOG_INIT("util.threadpool"); + typedef struct _virThreadPoolJob virThreadPoolJob; typedef virThreadPoolJob *virThreadPoolJobPtr; @@ -93,6 +96,24 @@ static inline bool virThreadPoolWorkerQuitHelper(size_t count, size_t limit) return count > limit; } + +static void +virThreadPoolJobRemove(virThreadPoolPtr pool, + virThreadPoolJobPtr job) +{ + if (job->prev) + job->prev->next = job->next; + else + pool->jobList.head = job->next; + if (job->next) + job->next->prev = job->prev; + else + pool->jobList.tail = job->prev; + + pool->jobQueueDepth--; +} + + static void virThreadPoolWorker(void *opaque) { struct virThreadPoolWorkerData *data = opaque; @@ -152,16 +173,7 @@ static void virThreadPoolWorker(void *opaque) pool->jobList.firstPrio = tmp; } - if (job->prev) - job->prev->next = job->next; - else - pool->jobList.head = job->next; - if (job->next) - job->next->prev = job->prev; - else - pool->jobList.tail = job->prev; - - pool->jobQueueDepth--; + virThreadPoolJobRemove(pool, job); virMutexUnlock(&pool->mutex); (pool->jobFunc)(job->data, pool->jobOpaque); @@ -307,6 +319,38 @@ void virThreadPoolFree(virThreadPoolPtr pool) } +/* + * virThreadPoolDrain: + * @pool: Pointer to thread pool + * + * Cause any pending job to be purged and notify the current workers + * of the impending quit. + */ +void +virThreadPoolDrain(virThreadPoolPtr pool) +{ + virMutexLock(&pool->mutex); + + VIR_DEBUG("nWorkers=%zd, nPrioWorkers=%zd jobQueueDepth=%zd", + pool->nWorkers, pool->nPrioWorkers, pool->jobQueueDepth); + + while (pool->jobList.head != pool->jobList.tail) { + virThreadPoolJobPtr job = pool->jobList.head; + + virThreadPoolJobRemove(pool, job); + VIR_FREE(job); + } + + pool->quit = true; + if (pool->nWorkers > 0) + virCondBroadcast(&pool->cond); + if (pool->nPrioWorkers > 0) + virCondBroadcast(&pool->prioCond); + + virMutexUnlock(&pool->mutex); +} + + size_t virThreadPoolGetMinWorkers(virThreadPoolPtr pool) { size_t ret; diff --git a/src/util/virthreadpool.h b/src/util/virthreadpool.h index e1f362f5b..c54b166b1 100644 --- a/src/util/virthreadpool.h +++ b/src/util/virthreadpool.h @@ -52,6 +52,8 @@ size_t virThreadPoolGetJobQueueDepth(virThreadPoolPtr pool); void virThreadPoolFree(virThreadPoolPtr pool); +void virThreadPoolDrain(virThreadPoolPtr pool); + int virThreadPoolSendJob(virThreadPoolPtr pool, unsigned int priority, void *jobdata) ATTRIBUTE_NONNULL(1) -- 2.13.6

When virNetDaemonQuit is called, we need to let the NetServers know a quit is pending and a subsequent Close will tear down the environment. Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/libvirt_remote.syms | 1 + src/rpc/virnetdaemon.c | 13 +++++++++++++ src/rpc/virnetserver.c | 23 +++++++++++++++++++++++ src/rpc/virnetserver.h | 2 ++ 4 files changed, 39 insertions(+) diff --git a/src/libvirt_remote.syms b/src/libvirt_remote.syms index a181c4cf7..3170fbd7c 100644 --- a/src/libvirt_remote.syms +++ b/src/libvirt_remote.syms @@ -116,6 +116,7 @@ virNetServerNewPostExecRestart; virNetServerNextClientID; virNetServerPreExecRestart; virNetServerProcessClients; +virNetServerQuitRequested; virNetServerSetClientAuthenticated; virNetServerStart; virNetServerUpdateServices; diff --git a/src/rpc/virnetdaemon.c b/src/rpc/virnetdaemon.c index 8c2141489..e5b376ced 100644 --- a/src/rpc/virnetdaemon.c +++ b/src/rpc/virnetdaemon.c @@ -850,6 +850,18 @@ virNetDaemonRun(virNetDaemonPtr dmn) } +static int +daemonServerQuitRequested(void *payload, + const void *key ATTRIBUTE_UNUSED, + void *opaque ATTRIBUTE_UNUSED) +{ + virNetServerPtr srv = payload; + + virNetServerQuitRequested(srv); + return 0; +} + + void virNetDaemonQuit(virNetDaemonPtr dmn) { @@ -857,6 +869,7 @@ virNetDaemonQuit(virNetDaemonPtr dmn) VIR_DEBUG("Quit requested %p", dmn); dmn->quit = true; + virHashForEach(dmn->servers, daemonServerQuitRequested, NULL); virObjectUnlock(dmn); } diff --git a/src/rpc/virnetserver.c b/src/rpc/virnetserver.c index 7bab11efb..7114749ab 100644 --- a/src/rpc/virnetserver.c +++ b/src/rpc/virnetserver.c @@ -823,6 +823,29 @@ void virNetServerDispose(void *obj) virNetServerMDNSFree(srv->mdns); } + +/* virNetServerQuitRequested: + * @srv: Netserver pointer + * + * Disable new connections and drain anything waiting to run. + */ +void +virNetServerQuitRequested(virNetServerPtr srv) +{ + size_t i; + + if (!srv) + return; + + VIR_DEBUG("Quit server requested '%s'", srv->name); + + for (i = 0; i < srv->nservices; i++) + virNetServerServiceToggle(srv->services[i], false); + + virThreadPoolDrain(srv->workers); +} + + void virNetServerClose(virNetServerPtr srv) { size_t i; diff --git a/src/rpc/virnetserver.h b/src/rpc/virnetserver.h index 7728a67f5..a9bd3480b 100644 --- a/src/rpc/virnetserver.h +++ b/src/rpc/virnetserver.h @@ -57,6 +57,8 @@ virNetServerPtr virNetServerNewPostExecRestart(virJSONValuePtr object, virFreeCallback clientPrivFree, void *clientPrivOpaque); +void virNetServerQuitRequested(virNetServerPtr srv); + void virNetServerClose(virNetServerPtr srv); virJSONValuePtr virNetServerPreExecRestart(virNetServerPtr srv); -- 2.13.6

Get a count of the number of workers for all servers that are still processing a job. Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/libvirt_remote.syms | 1 + src/rpc/virnetserver.c | 26 ++++++++++++++++++++++++++ src/rpc/virnetserver.h | 2 ++ 3 files changed, 29 insertions(+) diff --git a/src/libvirt_remote.syms b/src/libvirt_remote.syms index 3170fbd7c..c6fd156ef 100644 --- a/src/libvirt_remote.syms +++ b/src/libvirt_remote.syms @@ -120,6 +120,7 @@ virNetServerQuitRequested; virNetServerSetClientAuthenticated; virNetServerStart; virNetServerUpdateServices; +virNetServerWorkerCount; # rpc/virnetserverclient.h diff --git a/src/rpc/virnetserver.c b/src/rpc/virnetserver.c index 7114749ab..dc92dfea0 100644 --- a/src/rpc/virnetserver.c +++ b/src/rpc/virnetserver.c @@ -846,6 +846,32 @@ virNetServerQuitRequested(virNetServerPtr srv) } +/* virNetServerWorkerCount: + * @srv: Netserver pointer + * + * Return the number of workers still waiting to be processed. This + * allows the quit requested code to wait for all worker jobs to be + * completed before actually quitting. + */ +size_t +virNetServerWorkerCount(virNetServerPtr srv) +{ + size_t workerCount; + + virObjectLock(srv); + + workerCount = virThreadPoolGetCurrentWorkers(srv->workers) + + virThreadPoolGetPriorityWorkers(srv->workers); + + if (workerCount > 0) + VIR_DEBUG("server '%s' still has %zd workers", srv->name, workerCount); + + virObjectUnlock(srv); + + return workerCount; +} + + void virNetServerClose(virNetServerPtr srv) { size_t i; diff --git a/src/rpc/virnetserver.h b/src/rpc/virnetserver.h index a9bd3480b..1b1916814 100644 --- a/src/rpc/virnetserver.h +++ b/src/rpc/virnetserver.h @@ -59,6 +59,8 @@ virNetServerPtr virNetServerNewPostExecRestart(virJSONValuePtr object, void virNetServerQuitRequested(virNetServerPtr srv); +size_t virNetServerWorkerCount(virNetServerPtr srv); + void virNetServerClose(virNetServerPtr srv); virJSONValuePtr virNetServerPreExecRestart(virNetServerPtr srv); -- 2.13.6

When virNetDaemonQuit is called from libvirtd's shutdown handler (daemonShutdownHandler) we need to perform the quit in multiple steps. The first part is to "request" the quit and notify the NetServer's of the impending quit which causes the NetServers to Drain their pending queue and tell workers to quit and the second is wait for any currently running worker jobs to complete. Once the workers are complete, then we can cause the quit to occur. Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/rpc/virnetdaemon.c | 32 +++++++++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/src/rpc/virnetdaemon.c b/src/rpc/virnetdaemon.c index e5b376ced..6aee712a5 100644 --- a/src/rpc/virnetdaemon.c +++ b/src/rpc/virnetdaemon.c @@ -73,6 +73,7 @@ struct _virNetDaemon { virHashTablePtr servers; virJSONValuePtr srvObject; + bool quitRequested; bool quit; unsigned int autoShutdownTimeout; @@ -779,6 +780,32 @@ daemonServerProcessClients(void *payload, return 0; } + +static int +daemonServerWorkerCount(void *payload, + const void *key ATTRIBUTE_UNUSED, + void *opaque) +{ + size_t *workerCount = opaque; + virNetServerPtr srv = payload; + + *workerCount += virNetServerWorkerCount(srv); + + return 0; +} + + +static bool +daemonServerWorkersDone(virNetDaemonPtr dmn) +{ + size_t workerCount = 0; + + virHashForEach(dmn->servers, daemonServerWorkerCount, &workerCount); + + return workerCount == 0; +} + + void virNetDaemonRun(virNetDaemonPtr dmn) { @@ -843,6 +870,9 @@ virNetDaemonRun(virNetDaemonPtr dmn) virObjectLock(dmn); virHashForEach(dmn->servers, daemonServerProcessClients, NULL); + + if (dmn->quitRequested && daemonServerWorkersDone(dmn)) + dmn->quit = true; } cleanup: @@ -868,7 +898,7 @@ virNetDaemonQuit(virNetDaemonPtr dmn) virObjectLock(dmn); VIR_DEBUG("Quit requested %p", dmn); - dmn->quit = true; + dmn->quitRequested = true; virHashForEach(dmn->servers, daemonServerQuitRequested, NULL); virObjectUnlock(dmn); -- 2.13.6

On 01/19/2018 12:23 PM, John Ferlan wrote:
When virNetDaemonQuit is called from libvirtd's shutdown handler (daemonShutdownHandler) we need to perform the quit in multiple steps. The first part is to "request" the quit and notify the NetServer's of the impending quit which causes the NetServers to Drain their pending queue and tell workers to quit and the second is wait for any currently running worker jobs to complete. Once the workers are complete, then we can cause the quit to occur.
Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/rpc/virnetdaemon.c | 32 +++++++++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-)
sigh - too quick with the send and neglected to run the so what happens when I have nothing pending check/test... Attached is a patch that would be merged with this one. John
diff --git a/src/rpc/virnetdaemon.c b/src/rpc/virnetdaemon.c index e5b376ced..6aee712a5 100644 --- a/src/rpc/virnetdaemon.c +++ b/src/rpc/virnetdaemon.c @@ -73,6 +73,7 @@ struct _virNetDaemon { virHashTablePtr servers; virJSONValuePtr srvObject;
+ bool quitRequested; bool quit;
unsigned int autoShutdownTimeout; @@ -779,6 +780,32 @@ daemonServerProcessClients(void *payload, return 0; }
+ +static int +daemonServerWorkerCount(void *payload, + const void *key ATTRIBUTE_UNUSED, + void *opaque) +{ + size_t *workerCount = opaque; + virNetServerPtr srv = payload; + + *workerCount += virNetServerWorkerCount(srv); + + return 0; +} + + +static bool +daemonServerWorkersDone(virNetDaemonPtr dmn) +{ + size_t workerCount = 0; + + virHashForEach(dmn->servers, daemonServerWorkerCount, &workerCount); + + return workerCount == 0; +} + + void virNetDaemonRun(virNetDaemonPtr dmn) { @@ -843,6 +870,9 @@ virNetDaemonRun(virNetDaemonPtr dmn) virObjectLock(dmn);
virHashForEach(dmn->servers, daemonServerProcessClients, NULL); + + if (dmn->quitRequested && daemonServerWorkersDone(dmn)) + dmn->quit = true; }
cleanup: @@ -868,7 +898,7 @@ virNetDaemonQuit(virNetDaemonPtr dmn) virObjectLock(dmn);
VIR_DEBUG("Quit requested %p", dmn); - dmn->quit = true; + dmn->quitRequested = true; virHashForEach(dmn->servers, daemonServerQuitRequested, NULL);
virObjectUnlock(dmn);

On Fri, Jan 19, 2018 at 06:23 PM +0100, John Ferlan <jferlan@redhat.com> wrote:
When virNetDaemonQuit is called from libvirtd's shutdown handler (daemonShutdownHandler) we need to perform the quit in multiple steps. The first part is to "request" the quit and notify the NetServer's of the impending quit which causes the NetServers to Drain their pending queue and tell workers to quit and the second is wait for any currently running worker jobs to complete. Once the workers are complete, then we can cause the quit to occur.
Instead of “polling” for 'daemonServerWorkersDone' we could use a pipe for signaling. This would also ensure that we do not remain in the poll() call indefinitely and it avoids the timer hack. […snip] -- Beste Grüße / Kind regards Marc Hartmayer IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

Signed-off-by: John Ferlan <jferlan@redhat.com> --- docs/news.xml | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/docs/news.xml b/docs/news.xml index b4d980624..18800119d 100644 --- a/docs/news.xml +++ b/docs/news.xml @@ -39,6 +39,18 @@ <section title="Improvements"> </section> <section title="Bug fixes"> + <change> + <summary> + Fix issues with unexpected libvirtd termination + </summary> + <description> + If there were outstanding worker threads when libvirtd + received a quit signal via SIGTERM, SIGINT, or SIGQUIT + then, libvirtd may have crashed or hung waiting for a + worker job to send a completion notification that cannot + be received because the event loop was stopped too soon. + </description> + </change> </section> </release> <release version="v4.0.0" date="2018-01-19"> -- 2.13.6

Modify GetAllStats to generate a long enough pause in order to send a SIGTERM to libvirtd while a client connection is processing. In order to "set things up": 1. In one terminal window from a local branch built using these patches using a root account run libvirtd debug, e.g.: # ./run gdb daemon/libvirtd once running, type a 'c' (e.g. continue) and <return> 2. Start a domain (or have one running with the current libvirtd) virsh start $domain 3. Prepare a domstats command for that domain (but don't yet hit <return> in order run it): virsh domstats $domain 4. Prepare a kill command for the running libvirtd, e.g.: jferlan 4143 4107 0 09:51 pts/1 00:00:00 vim +1396 daemon/libvirtd.c root 30054 21195 6 11:17 pts/8 00:00:01 gdb /home/jferlan/git/libvirt.work/daemon/.libs/lt-libvirtd root 30087 30054 7 11:17 pts/8 00:00:01 /home/jferlan/git/libvirt.work/daemon/.libs/lt-libvirtd root 30385 19861 0 11:17 pts/17 00:00:00 grep --color=auto libvirtd but again don't hit <return> yet. 5. Align your stars perfectly now... a. Hit <return> on your domstats command b. Swap to the kill command window and hit <return> This should cause the libvirtd debug window to stop, but since you already typed 'c' it'll continue at least briefly, for example: ... [Thread 0x7fffc3231700 (LWP 30374) exited] Detaching after fork from child process 30376. Detaching after fork from child process 30377. Detaching after fork from child process 30378. [Thread 0x7fffc4445700 (LWP 30106) exited] c 2018-01-10 16:18:12.962+0000: 30094: info : libvirt version: 4.0.0 2018-01-10 16:18:12.962+0000: 30094: info : hostname: unknown4ceb42c824f4.attlocal.net 2018-01-10 16:18:12.962+0000: 30094: warning : qemuConnectGetAllDomainStats:20265 : k = -5340232226128654848 Thread 1 "lt-libvirtd" received signal SIGTERM, Terminated. 0x00007ffff3ae6d2d in poll () from /lib64/libc.so.6 ... (gdb) c Continuing. [Thread 0x7fffc5c48700 (LWP 30103) exited] [Thread 0x7fffc5447700 (LWP 30104) exited] [Thread 0x7fffc4c46700 (LWP 30105) exited] [Thread 0x7fffc6449700 (LWP 30102) exited] [Thread 0x7fffc6c4a700 (LWP 30101) exited] [Thread 0x7fffe3b57700 (LWP 30097) exited] [Thread 0x7fffe4358700 (LWP 30096) exited] [Thread 0x7fffe2354700 (LWP 30100) exited] [Thread 0x7fffe3356700 (LWP 30098) exited] [Thread 0x7fffe2b55700 (LWP 30099) exited] [Thread 0x7fffe535a700 (LWP 30094) exited] [Thread 0x7fffe5b5b700 (LWP 30093) exited] [Thread 0x7fffe635c700 (LWP 30092) exited] [Thread 0x7fffe6b5d700 (LWP 30091) exited] 2018-01-10 16:18:25.451+0000: 30095: warning : qemuConnectGetAllDomainStats:20265 : k = -5340232226128654848 [Thread 0x7fffe4b59700 (LWP 30095) exited] [Thread 0x7fffc3a32700 (LWP 30187) exited] [Inferior 1 (process 30087) exited normally] (gdb) c The program is not being run. (gdb) quit The virsh domstats window will "close" as follows: error: Disconnected from qemu:///system due to end of file error: End of file while reading data: Input/output error If something's wrong, then the libvirtd window may not exit those final two threads in which case you could interrupt it (^c) and check the threads (thread apply all bt) which will probably show some sort of hang... My testing shows that the hang no longer occurs with all the previous patches applied. The subsequent patch calling virHashRemoveAll from virNetDaemonClose does not seem to be necessary, although I suppose it cannot hurt as the same essential functionality occurs during the Dispose function Signed-off-by: John Ferlan <jferlan@redhat.com> --- src/qemu/qemu_driver.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index a203c9297..d1be8048f 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -20184,6 +20184,7 @@ qemuConnectGetAllDomainStats(virConnectPtr conn, bool enforce = !!(flags & VIR_CONNECT_GET_ALL_DOMAINS_STATS_ENFORCE_STATS); int nstats = 0; size_t i; + size_t j, k = 0; int ret = -1; unsigned int privflags = 0; unsigned int domflags = 0; @@ -20221,6 +20222,10 @@ qemuConnectGetAllDomainStats(virConnectPtr conn, if (qemuDomainGetStatsNeedMonitor(stats)) privflags |= QEMU_DOMAIN_STATS_HAVE_JOB; + for (j = 0; j < 10000000000; j++) // Add one more zero for longer... + k = j + k; + VIR_WARN("k = %zd", k); + for (i = 0; i < nvms; i++) { virDomainStatsRecordPtr tmp = NULL; domflags = 0; -- 2.13.6

Hi, John. Thank you for driving this issue! Unfortunately I have little time now to take participation in reviweing etc. Nikolay On 19.01.2018 20:23, John Ferlan wrote:
RFC: https://www.redhat.com/archives/libvir-list/2018-January/msg00318.html
Adjustments since RFC...
Patches 1&2: No change, were already R-B'd Patch 3: Removed code as noted in code review, update commit message Patch 4: From old series removed, see below for more details Patch 9: no change NB: Patches 5-8 and 10 from Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com> are removed as they seemed to not be necessary
Replaced the former patch 4 with series of patches to (slowly) provide support to disable new connections, handle removing waiting jobs, causing the waiting workers to quit, and allow any running jobs to complete.
As it turns out, waiting for running jobs to complete cannot be done from the virNetServerClose callbacks because that means the event loop processing done during virNetServerRun will not allow any currently long running worker job thread a means to complete.
So when virNetDaemonQuit is called as a result of the libvirtd signal handlers for SIG{QUIT|INT|TERM}, instead of just causing virNetServerRun to quit immediately, alter to using a quitRequested flag and then use that quitRequested flag to check for long running worker threads before causing the event loop to quit resulting in libvirtd being able to run through the virNetDaemonClose processing.
John Ferlan (9): libvirtd: Alter refcnt processing for domain server objects libvirtd: Alter refcnt processing for server program objects netserver: Remove ServiceToggle during ServerDispose util: Introduce virThreadPoolDrain rpc: Introduce virNetServerQuitRequested rpc: Introduce virNetServerWorkerCount rpc: Alter virNetDaemonQuit processing docs: Add news article for libvirtd issue APPLY ONLY FOR TESTING PURPOSES
daemon/libvirtd.c | 43 +++++++++++++++++++++++--------- docs/news.xml | 12 +++++++++ src/libvirt_private.syms | 1 + src/libvirt_remote.syms | 2 ++ src/qemu/qemu_driver.c | 5 ++++ src/rpc/virnetdaemon.c | 45 +++++++++++++++++++++++++++++++++- src/rpc/virnetserver.c | 52 ++++++++++++++++++++++++++++++++++++--- src/rpc/virnetserver.h | 4 +++ src/util/virthreadpool.c | 64 ++++++++++++++++++++++++++++++++++++++++-------- src/util/virthreadpool.h | 2 ++ 10 files changed, 204 insertions(+), 26 deletions(-)

On Fri, Jan 19, 2018 at 06:23 PM +0100, John Ferlan <jferlan@redhat.com> wrote:
RFC: https://www.redhat.com/archives/libvir-list/2018-January/msg00318.html
Adjustments since RFC...
Patches 1&2: No change, were already R-B'd Patch 3: Removed code as noted in code review, update commit message Patch 4: From old series removed, see below for more details Patch 9: no change NB: Patches 5-8 and 10 from Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com> are removed as they seemed to not be necessary
Replaced the former patch 4 with series of patches to (slowly) provide support to disable new connections, handle removing waiting jobs, causing the waiting workers to quit, and allow any running jobs to complete.
As it turns out, waiting for running jobs to complete cannot be done from the virNetServerClose callbacks because that means the event loop processing done during virNetServerRun will not allow any currently long running worker job thread a means to complete.
So when virNetDaemonQuit is called as a result of the libvirtd signal handlers for SIG{QUIT|INT|TERM}, instead of just causing virNetServerRun to quit immediately, alter to using a quitRequested flag and then use that quitRequested flag to check for long running worker threads before causing the event loop to quit resulting in libvirtd being able to run through the virNetDaemonClose processing.
Gave a quick test: + Didn't get a segmentation fault at the end of libvirtd (at least for my quick test) - a single SIGTERM doesn’t always lead to the termination of libvirtd now (debugged it: main thread is waiting for poll()). This behavior can be easily reproduced: Start libvirtd on the CLI, wait some seconds for the first initialization -> CTRL + C -> libvirtd doesn’t terminate, but also doesn’t accept new connections.
John Ferlan (9): libvirtd: Alter refcnt processing for domain server objects libvirtd: Alter refcnt processing for server program objects netserver: Remove ServiceToggle during ServerDispose util: Introduce virThreadPoolDrain rpc: Introduce virNetServerQuitRequested rpc: Introduce virNetServerWorkerCount rpc: Alter virNetDaemonQuit processing docs: Add news article for libvirtd issue APPLY ONLY FOR TESTING PURPOSES
daemon/libvirtd.c | 43 +++++++++++++++++++++++--------- docs/news.xml | 12 +++++++++ src/libvirt_private.syms | 1 + src/libvirt_remote.syms | 2 ++ src/qemu/qemu_driver.c | 5 ++++ src/rpc/virnetdaemon.c | 45 +++++++++++++++++++++++++++++++++- src/rpc/virnetserver.c | 52 ++++++++++++++++++++++++++++++++++++--- src/rpc/virnetserver.h | 4 +++ src/util/virthreadpool.c | 64 ++++++++++++++++++++++++++++++++++++++++-------- src/util/virthreadpool.h | 2 ++ 10 files changed, 204 insertions(+), 26 deletions(-)
-- 2.13.6
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- Beste Grüße / Kind regards Marc Hartmayer IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

On 01/23/2018 04:21 AM, Marc Hartmayer wrote:
On Fri, Jan 19, 2018 at 06:23 PM +0100, John Ferlan <jferlan@redhat.com> wrote:
RFC: https://www.redhat.com/archives/libvir-list/2018-January/msg00318.html
Adjustments since RFC...
Patches 1&2: No change, were already R-B'd Patch 3: Removed code as noted in code review, update commit message Patch 4: From old series removed, see below for more details Patch 9: no change NB: Patches 5-8 and 10 from Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com> are removed as they seemed to not be necessary
Replaced the former patch 4 with series of patches to (slowly) provide support to disable new connections, handle removing waiting jobs, causing the waiting workers to quit, and allow any running jobs to complete.
As it turns out, waiting for running jobs to complete cannot be done from the virNetServerClose callbacks because that means the event loop processing done during virNetServerRun will not allow any currently long running worker job thread a means to complete.
So when virNetDaemonQuit is called as a result of the libvirtd signal handlers for SIG{QUIT|INT|TERM}, instead of just causing virNetServerRun to quit immediately, alter to using a quitRequested flag and then use that quitRequested flag to check for long running worker threads before causing the event loop to quit resulting in libvirtd being able to run through the virNetDaemonClose processing.
Gave a quick test: + Didn't get a segmentation fault at the end of libvirtd (at least for my quick test) - a single SIGTERM doesn’t always lead to the termination of libvirtd now (debugged it: main thread is waiting for poll()). This behavior can be easily reproduced: Start libvirtd on the CLI, wait some seconds for the first initialization -> CTRL + C -> libvirtd doesn’t terminate, but also doesn’t accept new connections.
Does this include the "update" in my response to patch 7? It should be extract-able and apply-able. John
John Ferlan (9): libvirtd: Alter refcnt processing for domain server objects libvirtd: Alter refcnt processing for server program objects netserver: Remove ServiceToggle during ServerDispose util: Introduce virThreadPoolDrain rpc: Introduce virNetServerQuitRequested rpc: Introduce virNetServerWorkerCount rpc: Alter virNetDaemonQuit processing docs: Add news article for libvirtd issue APPLY ONLY FOR TESTING PURPOSES
daemon/libvirtd.c | 43 +++++++++++++++++++++++--------- docs/news.xml | 12 +++++++++ src/libvirt_private.syms | 1 + src/libvirt_remote.syms | 2 ++ src/qemu/qemu_driver.c | 5 ++++ src/rpc/virnetdaemon.c | 45 +++++++++++++++++++++++++++++++++- src/rpc/virnetserver.c | 52 ++++++++++++++++++++++++++++++++++++--- src/rpc/virnetserver.h | 4 +++ src/util/virthreadpool.c | 64 ++++++++++++++++++++++++++++++++++++++++-------- src/util/virthreadpool.h | 2 ++ 10 files changed, 204 insertions(+), 26 deletions(-)
-- 2.13.6
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- Beste Grüße / Kind regards Marc Hartmayer
IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

On Tue, Jan 23, 2018 at 04:01 PM +0100, John Ferlan <jferlan@redhat.com> wrote:
On 01/23/2018 04:21 AM, Marc Hartmayer wrote:
On Fri, Jan 19, 2018 at 06:23 PM +0100, John Ferlan <jferlan@redhat.com> wrote:
RFC: https://www.redhat.com/archives/libvir-list/2018-January/msg00318.html
Adjustments since RFC...
Patches 1&2: No change, were already R-B'd Patch 3: Removed code as noted in code review, update commit message Patch 4: From old series removed, see below for more details Patch 9: no change NB: Patches 5-8 and 10 from Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com> are removed as they seemed to not be necessary
Replaced the former patch 4 with series of patches to (slowly) provide support to disable new connections, handle removing waiting jobs, causing the waiting workers to quit, and allow any running jobs to complete.
As it turns out, waiting for running jobs to complete cannot be done from the virNetServerClose callbacks because that means the event loop processing done during virNetServerRun will not allow any currently long running worker job thread a means to complete.
So when virNetDaemonQuit is called as a result of the libvirtd signal handlers for SIG{QUIT|INT|TERM}, instead of just causing virNetServerRun to quit immediately, alter to using a quitRequested flag and then use that quitRequested flag to check for long running worker threads before causing the event loop to quit resulting in libvirtd being able to run through the virNetDaemonClose processing.
Gave a quick test: + Didn't get a segmentation fault at the end of libvirtd (at least for my quick test) - a single SIGTERM doesn’t always lead to the termination of libvirtd now (debugged it: main thread is waiting for poll()). This behavior can be easily reproduced: Start libvirtd on the CLI, wait some seconds for the first initialization -> CTRL + C -> libvirtd doesn’t terminate, but also doesn’t accept new connections.
Does this include the "update" in my response to patch 7? It should be extract-able and apply-able.
No, I forgot this one :( I’ll try it tomorrow.
John
John Ferlan (9): libvirtd: Alter refcnt processing for domain server objects libvirtd: Alter refcnt processing for server program objects netserver: Remove ServiceToggle during ServerDispose util: Introduce virThreadPoolDrain rpc: Introduce virNetServerQuitRequested rpc: Introduce virNetServerWorkerCount rpc: Alter virNetDaemonQuit processing docs: Add news article for libvirtd issue APPLY ONLY FOR TESTING PURPOSES
daemon/libvirtd.c | 43 +++++++++++++++++++++++--------- docs/news.xml | 12 +++++++++ src/libvirt_private.syms | 1 + src/libvirt_remote.syms | 2 ++ src/qemu/qemu_driver.c | 5 ++++ src/rpc/virnetdaemon.c | 45 +++++++++++++++++++++++++++++++++- src/rpc/virnetserver.c | 52 ++++++++++++++++++++++++++++++++++++--- src/rpc/virnetserver.h | 4 +++ src/util/virthreadpool.c | 64 ++++++++++++++++++++++++++++++++++++++++-------- src/util/virthreadpool.h | 2 ++ 10 files changed, 204 insertions(+), 26 deletions(-)
-- 2.13.6
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- Beste Grüße / Kind regards Marc Hartmayer
IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294
-- Beste Grüße / Kind regards Marc Hartmayer IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

On Tue, Jan 23, 2018 at 05:41 PM +0100, Marc Hartmayer <mhartmay@linux.vnet.ibm.com> wrote:
On Tue, Jan 23, 2018 at 04:01 PM +0100, John Ferlan <jferlan@redhat.com> wrote:
On 01/23/2018 04:21 AM, Marc Hartmayer wrote:
On Fri, Jan 19, 2018 at 06:23 PM +0100, John Ferlan <jferlan@redhat.com> wrote:
RFC: https://www.redhat.com/archives/libvir-list/2018-January/msg00318.html
Adjustments since RFC...
Patches 1&2: No change, were already R-B'd Patch 3: Removed code as noted in code review, update commit message Patch 4: From old series removed, see below for more details Patch 9: no change NB: Patches 5-8 and 10 from Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com> are removed as they seemed to not be necessary
Replaced the former patch 4 with series of patches to (slowly) provide support to disable new connections, handle removing waiting jobs, causing the waiting workers to quit, and allow any running jobs to complete.
As it turns out, waiting for running jobs to complete cannot be done from the virNetServerClose callbacks because that means the event loop processing done during virNetServerRun will not allow any currently long running worker job thread a means to complete.
So when virNetDaemonQuit is called as a result of the libvirtd signal handlers for SIG{QUIT|INT|TERM}, instead of just causing virNetServerRun to quit immediately, alter to using a quitRequested flag and then use that quitRequested flag to check for long running worker threads before causing the event loop to quit resulting in libvirtd being able to run through the virNetDaemonClose processing.
Gave a quick test: + Didn't get a segmentation fault at the end of libvirtd (at least for my quick test) - a single SIGTERM doesn’t always lead to the termination of libvirtd now (debugged it: main thread is waiting for poll()). This behavior can be easily reproduced: Start libvirtd on the CLI, wait some seconds for the first initialization -> CTRL + C -> libvirtd doesn’t terminate, but also doesn’t accept new connections.
Does this include the "update" in my response to patch 7? It should be extract-able and apply-able.
No, I forgot this one :( I’ll try it tomorrow.
It works with the new patch 7. […snip] -- Beste Grüße / Kind regards Marc Hartmayer IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

On 19.01.2018 20:23, John Ferlan wrote:
RFC: https://www.redhat.com/archives/libvir-list/2018-January/msg00318.html
Adjustments since RFC...
Patches 1&2: No change, were already R-B'd Patch 3: Removed code as noted in code review, update commit message Patch 4: From old series removed, see below for more details Patch 9: no change NB: Patches 5-8 and 10 from Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com> are removed as they seemed to not be necessary
Replaced the former patch 4 with series of patches to (slowly) provide support to disable new connections, handle removing waiting jobs, causing the waiting workers to quit, and allow any running jobs to complete.
As it turns out, waiting for running jobs to complete cannot be done from the virNetServerClose callbacks because that means the event loop processing done during virNetServerRun will not allow any currently long running worker job thread a means to complete.
Hi, John. So you suggest a different appoarch. Instead of introducing means to help rpc threads to finish after event loop is finished (stateShutdown hook in my series) you suggest to block futher new rpc processing and then waiting running rpc calls to finish keeping event loop running for that purpose. This could lead to libvirtd never finish its termination if one of qemu processes do not respond for example. In case of long running operation such as migration finishing can take much time. On the over hand if we finish rpc threads abruptly as in case of stateShutdown hook we will deal with all possible error paths on daemon finishing. May be good approach is to abort long running operation, then give rpc threads some time to finish as you suggest and only after that finish them abruptly to handle hanging qemu etc. Also in this approach we keep event loop running for rpc threads only but there can be other threads that depend on the loop. For example if qemu is rebooted we spawn a thread that executes qemuProcessFakeReboot that sends commands to qemu (i.e. depends on event loop). We need to await such threads finishing too while keep event loop running. In approach of stateShutdown hook we finish all threads that uses qemu monitor by closing the monitor. And hack with timeout in a loop... I think of a different aproach for waiting rpc threads to finish their work. First let's make drain function only flush job queue and take some callback which will be called when all workers will be free from work (let's keep worker threads as Dan suggested). In this callback we can use same technique as in virNetDaemonSignalHandler. That is make some IO to set some flag in the event loop thread and cause virEventRunDefaultImpl to finish and then test this flag in virNetDaemonRun. Nikolay
So when virNetDaemonQuit is called as a result of the libvirtd signal handlers for SIG{QUIT|INT|TERM}, instead of just causing virNetServerRun to quit immediately, alter to using a quitRequested flag and then use that quitRequested flag to check for long running worker threads before causing the event loop to quit resulting in libvirtd being able to run through the virNetDaemonClose processing.
John Ferlan (9): libvirtd: Alter refcnt processing for domain server objects libvirtd: Alter refcnt processing for server program objects netserver: Remove ServiceToggle during ServerDispose util: Introduce virThreadPoolDrain rpc: Introduce virNetServerQuitRequested rpc: Introduce virNetServerWorkerCount rpc: Alter virNetDaemonQuit processing docs: Add news article for libvirtd issue APPLY ONLY FOR TESTING PURPOSES
daemon/libvirtd.c | 43 +++++++++++++++++++++++--------- docs/news.xml | 12 +++++++++ src/libvirt_private.syms | 1 + src/libvirt_remote.syms | 2 ++ src/qemu/qemu_driver.c | 5 ++++ src/rpc/virnetdaemon.c | 45 +++++++++++++++++++++++++++++++++- src/rpc/virnetserver.c | 52 ++++++++++++++++++++++++++++++++++++--- src/rpc/virnetserver.h | 4 +++ src/util/virthreadpool.c | 64 ++++++++++++++++++++++++++++++++++++++++-------- src/util/virthreadpool.h | 2 ++ 10 files changed, 204 insertions(+), 26 deletions(-)

On Fri, Jan 26, 2018 at 11:47:04AM +0300, Nikolay Shirokovskiy wrote:
On 19.01.2018 20:23, John Ferlan wrote:
RFC: https://www.redhat.com/archives/libvir-list/2018-January/msg00318.html
Adjustments since RFC...
Patches 1&2: No change, were already R-B'd Patch 3: Removed code as noted in code review, update commit message Patch 4: From old series removed, see below for more details Patch 9: no change NB: Patches 5-8 and 10 from Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com> are removed as they seemed to not be necessary
Replaced the former patch 4 with series of patches to (slowly) provide support to disable new connections, handle removing waiting jobs, causing the waiting workers to quit, and allow any running jobs to complete.
As it turns out, waiting for running jobs to complete cannot be done from the virNetServerClose callbacks because that means the event loop processing done during virNetServerRun will not allow any currently long running worker job thread a means to complete.
Hi, John.
So you suggest a different appoarch. Instead of introducing means to help rpc threads to finish after event loop is finished (stateShutdown hook in my series) you suggest to block futher new rpc processing and then waiting running rpc calls to finish keeping event loop running for that purpose.
This could lead to libvirtd never finish its termination if one of qemu processes do not respond for example. In case of long running operation such as migration finishing can take much time. On the over hand if we finish rpc threads abruptly as in case of stateShutdown hook we will deal with all possible error paths on daemon finishing. May be good approach is to abort long running operation, then give rpc threads some time to finish as you suggest and only after that finish them abruptly to handle hanging qemu etc.
We should keep in mind why we are bothering todo any of this "graceful" cleanup. 99% of the time it would be fine for libvirtd to just immediately immediately exit and not run any cleanup code, since the OS cleans everything up regardless. Really the only benefit of running cleanup is so that developers can check for memory leaks across the process execution. On balance it is *way* more important that libvirtd is guranteed to exit in a short, finite amount of time, even if that means skipping cleanup, because that is what is relevant to production deployment. So this means while it is nice to wait for worker threads to complete their current RPC option, we should not wait too long. eg Give them 15 seconds to finish their work, and if they're not done then just unconditionally exit libvirtd, possibly skipping remaining cleanup if that's neccessary to avoid SEGV. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 01/26/2018 03:47 AM, Nikolay Shirokovskiy wrote:
On 19.01.2018 20:23, John Ferlan wrote:
RFC: https://www.redhat.com/archives/libvir-list/2018-January/msg00318.html
Adjustments since RFC...
Patches 1&2: No change, were already R-B'd Patch 3: Removed code as noted in code review, update commit message Patch 4: From old series removed, see below for more details Patch 9: no change NB: Patches 5-8 and 10 from Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com> are removed as they seemed to not be necessary
Replaced the former patch 4 with series of patches to (slowly) provide support to disable new connections, handle removing waiting jobs, causing the waiting workers to quit, and allow any running jobs to complete.
As it turns out, waiting for running jobs to complete cannot be done from the virNetServerClose callbacks because that means the event loop processing done during virNetServerRun will not allow any currently long running worker job thread a means to complete.
Hi, John.
Sorry - been busy in other areas lately - trying to get back to this...
So you suggest a different appoarch. Instead of introducing means to help rpc threads to finish after event loop is finished (stateShutdown hook in my series) you suggest to block futher new rpc processing and then waiting running rpc calls to finish keeping event loop running for that purpose.
Somewhere along the way in one of the reviews it was suggested to give the workers perhaps a few extra cycles to complete anything they might be doing. It was also suggested a mechanism to do that - which is what I tried to accomplish here. Based on that and that your mechanism was more of an "immediate separation" by IIRC closing the monitor connection and letting that close handler do whatever magic happens there. This not to say your mechanism is the wrong way to go about it, but it also hasn't garnered support nor have you actively attempted to champion it. So I presented an alternative approach.
This could lead to libvirtd never finish its termination if one of qemu processes do not respond for example. In case of long running operation such as migration finishing can take much time. On the over hand if we finish rpc threads abruptly as in case of stateShutdown hook we will deal with all possible error paths on daemon finishing. May be good approach is to abort long running operation, then give rpc threads some time to finish as you suggest and only after that finish them abruptly to handle hanging qemu etc.
True, but it's also easy enough to add something to the last and updated patch to add the QuitTimer and force an exit without going through the rest of the "friendly" quit (IOW: more or less what Dan has suggested). There's an incredible amount of cycles being spent for what amounts to trying to be nice from a SIGTERM, SIGQUIT, or SIGINT - IOW: eventual death.
Also in this approach we keep event loop running for rpc threads only but there can be other threads that depend on the loop. For example if qemu is rebooted we spawn a thread that executes qemuProcessFakeReboot that sends commands to qemu (i.e. depends on event loop). We need to await such threads finishing too while keep event loop running. In approach of stateShutdown hook we finish all threads that uses qemu monitor by closing the monitor.
And hack with timeout in a loop... I think of a different aproach for waiting rpc threads to finish their work. First let's make drain function only flush job queue and take some callback which will be called when all workers will be free from work (let's keep worker threads as Dan suggested). In this callback we can use same technique as in virNetDaemonSignalHandler. That is make some IO to set some flag in the event loop thread and cause virEventRunDefaultImpl to finish and then test this flag in virNetDaemonRun.
Please feel free to spend as many cycles as you can making adjustments to what I have. I'm working through some alterations and posting a v2 of what I have and we'll see where it takes me/us. John
Nikolay
So when virNetDaemonQuit is called as a result of the libvirtd signal handlers for SIG{QUIT|INT|TERM}, instead of just causing virNetServerRun to quit immediately, alter to using a quitRequested flag and then use that quitRequested flag to check for long running worker threads before causing the event loop to quit resulting in libvirtd being able to run through the virNetDaemonClose processing.
John Ferlan (9): libvirtd: Alter refcnt processing for domain server objects libvirtd: Alter refcnt processing for server program objects netserver: Remove ServiceToggle during ServerDispose util: Introduce virThreadPoolDrain rpc: Introduce virNetServerQuitRequested rpc: Introduce virNetServerWorkerCount rpc: Alter virNetDaemonQuit processing docs: Add news article for libvirtd issue APPLY ONLY FOR TESTING PURPOSES
daemon/libvirtd.c | 43 +++++++++++++++++++++++--------- docs/news.xml | 12 +++++++++ src/libvirt_private.syms | 1 + src/libvirt_remote.syms | 2 ++ src/qemu/qemu_driver.c | 5 ++++ src/rpc/virnetdaemon.c | 45 +++++++++++++++++++++++++++++++++- src/rpc/virnetserver.c | 52 ++++++++++++++++++++++++++++++++++++--- src/rpc/virnetserver.h | 4 +++ src/util/virthreadpool.c | 64 ++++++++++++++++++++++++++++++++++++++++-------- src/util/virthreadpool.h | 2 ++ 10 files changed, 204 insertions(+), 26 deletions(-)

On Wed, Feb 14, 2018 at 11:36 PM +0100, John Ferlan <jferlan@redhat.com> wrote:
On 01/26/2018 03:47 AM, Nikolay Shirokovskiy wrote:
On 19.01.2018 20:23, John Ferlan wrote:
RFC: https://www.redhat.com/archives/libvir-list/2018-January/msg00318.html
Adjustments since RFC...
Patches 1&2: No change, were already R-B'd Patch 3: Removed code as noted in code review, update commit message Patch 4: From old series removed, see below for more details Patch 9: no change NB: Patches 5-8 and 10 from Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com> are removed as they seemed to not be necessary
Replaced the former patch 4 with series of patches to (slowly) provide support to disable new connections, handle removing waiting jobs, causing the waiting workers to quit, and allow any running jobs to complete.
As it turns out, waiting for running jobs to complete cannot be done from the virNetServerClose callbacks because that means the event loop processing done during virNetServerRun will not allow any currently long running worker job thread a means to complete.
Hi, John.
Sorry - been busy in other areas lately - trying to get back to this...
So you suggest a different appoarch. Instead of introducing means to help rpc threads to finish after event loop is finished (stateShutdown hook in my series) you suggest to block futher new rpc processing and then waiting running rpc calls to finish keeping event loop running for that purpose.
Somewhere along the way in one of the reviews it was suggested to give the workers perhaps a few extra cycles to complete anything they might be doing. It was also suggested a mechanism to do that - which is what I tried to accomplish here.
Based on that and that your mechanism was more of an "immediate separation" by IIRC closing the monitor connection and letting that close handler do whatever magic happens there.
This not to say your mechanism is the wrong way to go about it, but it also hasn't garnered support nor have you actively attempted to champion it. So I presented an alternative approach.
This could lead to libvirtd never finish its termination if one of qemu processes do not respond for example. In case of long running operation such as migration finishing can take much time. On the over hand if we finish rpc threads abruptly as in case of stateShutdown hook we will deal with all possible error paths on daemon finishing. May be good approach is to abort long running operation, then give rpc threads some time to finish as you suggest and only after that finish them abruptly to handle hanging qemu etc.
True, but it's also easy enough to add something to the last and updated patch to add the QuitTimer and force an exit without going through the rest of the "friendly" quit (IOW: more or less what Dan has suggested).
There's an incredible amount of cycles being spent for what amounts to trying to be nice from a SIGTERM, SIGQUIT, or SIGINT - IOW: eventual death.
Also in this approach we keep event loop running for rpc threads only but there can be other threads that depend on the loop. For example if qemu is rebooted we spawn a thread that executes qemuProcessFakeReboot that sends commands to qemu (i.e. depends on event loop). We need to await such threads finishing too while keep event loop running. In approach of stateShutdown hook we finish all threads that uses qemu monitor by closing the monitor.
And hack with timeout in a loop... I think of a different aproach for waiting rpc threads to finish their work. First let's make drain function only flush job queue and take some callback which will be called when all workers will be free from work (let's keep worker threads as Dan suggested). In this callback we can use same technique as in virNetDaemonSignalHandler. That is make some IO to set some flag in the event loop thread and cause virEventRunDefaultImpl to finish and then test this flag in virNetDaemonRun.
Please feel free to spend as many cycles as you can making adjustments to what I have. I'm working through some alterations and posting a v2 of what I have and we'll see where it takes me/us.
John
Is there any update so far? I’m asking because I’m still getting segmentation faults and hang-ups on termination of libvirtd (using the newest version of libvirt). Example for a hang-up: ➤ bt #0 0x000003fffca8df84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000003fffdac29ca in virCondWait (c=<optimized out>, m=<optimized out>) at ../../src/util/virthread.c:154 #2 0x000003fffdac381c in virThreadPoolFree (pool=<optimized out>) at ../../src/util/virthreadpool.c:290 #3 0x000003fffdbb21ae in virNetServerDispose (obj=0x1000cc640) at ../../src/rpc/virnetserver.c:803 #4 0x000003fffda97286 in virObjectUnref (anyobj=<optimized out>) at ../../src/util/virobject.c:350 #5 0x000003fffda97a5a in virObjectFreeHashData (opaque=<optimized out>, name=<optimized out>) at ../../src/util/virobject.c:591 #6 0x000003fffda66576 in virHashFree (table=<optimized out>) at ../../src/util/virhash.c:305 #7 0x000003fffdbaff82 in virNetDaemonDispose (obj=0x1000cc3c0) at ../../src/rpc/virnetdaemon.c:105 #8 0x000003fffda97286 in virObjectUnref (anyobj=<optimized out>) at ../../src/util/virobject.c:350 #9 0x0000000100026cd6 in main (argc=<optimized out>, argv=<optimized out>) at ../../src/remote/remote_daemon.c:1487 And segmentation faults happen for RPC jobs that are still running. […snip] -- Beste Grüße / Kind regards Marc Hartmayer IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

Is there any update so far? I’m asking because I’m still getting segmentation faults and hang-ups on termination of libvirtd (using the newest version of libvirt).
Example for a hang-up: ➤ bt #0 0x000003fffca8df84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000003fffdac29ca in virCondWait (c=<optimized out>, m=<optimized out>) at ../../src/util/virthread.c:154 #2 0x000003fffdac381c in virThreadPoolFree (pool=<optimized out>) at ../../src/util/virthreadpool.c:290 #3 0x000003fffdbb21ae in virNetServerDispose (obj=0x1000cc640) at ../../src/rpc/virnetserver.c:803 #4 0x000003fffda97286 in virObjectUnref (anyobj=<optimized out>) at ../../src/util/virobject.c:350 #5 0x000003fffda97a5a in virObjectFreeHashData (opaque=<optimized out>, name=<optimized out>) at ../../src/util/virobject.c:591 #6 0x000003fffda66576 in virHashFree (table=<optimized out>) at ../../src/util/virhash.c:305 #7 0x000003fffdbaff82 in virNetDaemonDispose (obj=0x1000cc3c0) at ../../src/rpc/virnetdaemon.c:105 #8 0x000003fffda97286 in virObjectUnref (anyobj=<optimized out>) at ../../src/util/virobject.c:350 #9 0x0000000100026cd6 in main (argc=<optimized out>, argv=<optimized out>) at ../../src/remote/remote_daemon.c:1487
And segmentation faults happen for RPC jobs that are still running.
There has been zero of my cycles spent thinking about this. Partially because I'm busy in other areas, partially because I know Daniel is planning changes in libvirtd (https://www.redhat.com/archives/libvir-list/2018-May/msg01307.html), and partially because I'm not sure I have a {reliable|simple} reproducer (at least I don't recall). I do still have various branches in various states of disarray that are way behind current head (easy to happen it seems lately). John

On Tue, Jul 03, 2018 at 09:21 PM +0200, John Ferlan <jferlan@redhat.com> wrote:
Is there any update so far? I’m asking because I’m still getting segmentation faults and hang-ups on termination of libvirtd (using the newest version of libvirt).
Example for a hang-up: ➤ bt #0 0x000003fffca8df84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000003fffdac29ca in virCondWait (c=<optimized out>, m=<optimized out>) at ../../src/util/virthread.c:154 #2 0x000003fffdac381c in virThreadPoolFree (pool=<optimized out>) at ../../src/util/virthreadpool.c:290 #3 0x000003fffdbb21ae in virNetServerDispose (obj=0x1000cc640) at ../../src/rpc/virnetserver.c:803 #4 0x000003fffda97286 in virObjectUnref (anyobj=<optimized out>) at ../../src/util/virobject.c:350 #5 0x000003fffda97a5a in virObjectFreeHashData (opaque=<optimized out>, name=<optimized out>) at ../../src/util/virobject.c:591 #6 0x000003fffda66576 in virHashFree (table=<optimized out>) at ../../src/util/virhash.c:305 #7 0x000003fffdbaff82 in virNetDaemonDispose (obj=0x1000cc3c0) at ../../src/rpc/virnetdaemon.c:105 #8 0x000003fffda97286 in virObjectUnref (anyobj=<optimized out>) at ../../src/util/virobject.c:350 #9 0x0000000100026cd6 in main (argc=<optimized out>, argv=<optimized out>) at ../../src/remote/remote_daemon.c:1487
And segmentation faults happen for RPC jobs that are still running.
There has been zero of my cycles spent thinking about this. Partially because I'm busy in other areas, partially because I know Daniel is planning changes in libvirtd (https://www.redhat.com/archives/libvir-list/2018-May/msg01307.html), and partially because I'm not sure I have a {reliable|simple} reproducer (at least I don't recall).
I do a simple start/destroy loop and send a SIGTERM to libvirtd. This leads in almost every case to a segmentation fault.
I do still have various branches in various states of disarray that are way behind current head (easy to happen it seems lately).
John
-- Beste Grüße / Kind regards Marc Hartmayer IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

On 07/06/2018 05:15 AM, Marc Hartmayer wrote:
On Tue, Jul 03, 2018 at 09:21 PM +0200, John Ferlan <jferlan@redhat.com> wrote:
Is there any update so far? I’m asking because I’m still getting segmentation faults and hang-ups on termination of libvirtd (using the newest version of libvirt).
Example for a hang-up: ➤ bt #0 0x000003fffca8df84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000003fffdac29ca in virCondWait (c=<optimized out>, m=<optimized out>) at ../../src/util/virthread.c:154 #2 0x000003fffdac381c in virThreadPoolFree (pool=<optimized out>) at ../../src/util/virthreadpool.c:290 #3 0x000003fffdbb21ae in virNetServerDispose (obj=0x1000cc640) at ../../src/rpc/virnetserver.c:803 #4 0x000003fffda97286 in virObjectUnref (anyobj=<optimized out>) at ../../src/util/virobject.c:350 #5 0x000003fffda97a5a in virObjectFreeHashData (opaque=<optimized out>, name=<optimized out>) at ../../src/util/virobject.c:591 #6 0x000003fffda66576 in virHashFree (table=<optimized out>) at ../../src/util/virhash.c:305 #7 0x000003fffdbaff82 in virNetDaemonDispose (obj=0x1000cc3c0) at ../../src/rpc/virnetdaemon.c:105 #8 0x000003fffda97286 in virObjectUnref (anyobj=<optimized out>) at ../../src/util/virobject.c:350 #9 0x0000000100026cd6 in main (argc=<optimized out>, argv=<optimized out>) at ../../src/remote/remote_daemon.c:1487
And segmentation faults happen for RPC jobs that are still running.
There has been zero of my cycles spent thinking about this. Partially because I'm busy in other areas, partially because I know Daniel is planning changes in libvirtd (https://www.redhat.com/archives/libvir-list/2018-May/msg01307.html), and partially because I'm not sure I have a {reliable|simple} reproducer (at least I don't recall).
I do a simple start/destroy loop and send a SIGTERM to libvirtd. This leads in almost every case to a segmentation fault.
I do still have various branches in various states of disarray that are way behind current head (easy to happen it seems lately).
I found/updated my branches and started recalling where things were left off. I had R-By's for patch 1 & 2 until you noted the issue with the global variables which was fixed in my branch. Since there's been a lot of code motion since and what I have now is different I figure I should repost the series and go from there. I haven't addressed the use pipe instead of polling comment you had in patch7. I haven't put much thought into it, but in general given how things work how would you expect pipe signaling to work in this model? Essentially we have a client that either is not responding or is in the middle of something large. NB: I won't be around for the first half of next week to see any responses or answer questions... Perhaps it's vacation mode that isn't allowing me to think about the pipe signaling logic right now ;-) John
participants (5)
-
Daniel P. Berrangé
-
John Ferlan
-
Marc Hartmayer
-
Marc Hartmayer
-
Nikolay Shirokovskiy