[...]
>>
>> Now you've lost me. What are the back traces? and now does one
>> reasonably reproduce? Are you trying to advocate here for [2] to be
>> reviewed/accepted?
>
> Sorry for that. Unfortunately back trace was only emailed in the first version
> of series [3]. The reproducer is there too.
>
> I'm not here for [2]. It addresses a different issue. However I want to
> make some progress on that series too.
>
> [3]
https://www.redhat.com/archives/libvir-list/2017-September/msg01006.html
>
Ahh, sigh... I should have been more "proactive" in ensuring that I had
gone back through the above test when two of the patches I had posted
related to the changes in this code were not ACK'd. Those two patches are:
https://www.redhat.com/archives/libvir-list/2017-November/msg00296.html
and
https://www.redhat.com/archives/libvir-list/2017-November/msg00297.html
What those did was perform the Unref from libvirtd as soon as the
AddServer or AddProgram added to the respective hash table (and of
course incr the refcnt). The result is the Unref for each was done
*prior* to the virNetDaemonClose and prior to the virStateCleanup which
solves the late run after Close when Dispose runs.
What jiggled my memory on this was today there was an OFTC IRC posting
regarding a virt-manager crash (see pastebin from cbosdonnat at
http://paste.opensuse.org/44522138) as a result of incorrect ordering of
daemon unref and running virStateCleanup. A copy/paste of the
subsequent conversation is:
<cbosdonnat> I'm seeing a crash here when libvirtd is stopping and a
client (namely virt-manager) calls
networkConnectNetworkEventDeregisterAny().... is it OK to just goto
cleanup if the driver is NULL in that function?
<danpb> cbosdonnat: you mean the driver is being torn down in
virStateCleanup ? and an API is still allowed to be run ?
<cbosdonnat> danpb, looks like that indeed
<cbosdonnat> where would the API call be rejected otherwise?
<danpb> that seems a much more general problem - we should not call
virStateCleanup until we've killed off all the API worker threads
<danpb> i kind of thought we already did that
<danpb> is it perhaps that we're killing off active clients and as part
of that we're cleaning up events they have registered ?
<cbosdonnat> danpb, here is the backtrace of the crash:
http://paste.opensuse.org/44522138
<cbosdonnat> so seems like that indeed
<danpb> oh yeah, we're killing off clients
<danpb> we need to explicitly be able to kill off clients before
shutting down drivers
<danpb> surprised we've not seen this before to be honest
<danpb> i wonder if some refactoring has introduced this problem recently
<cbosdonnat> are those done in separate threads?
<danpb> i can't rememeber to be honest
<cbosdonnat> I'll chase the order of these then
<cbosdonnat> danpb, the state cleanup happens before the daemon unref...
looks like we'll need to purge the daemon before that
<cbosdonnat> danpb, would that be completely silly to move the client
closing code to virNetServerClose()?
<danpb> that seems pretty sensible to me actually
<danpb> logically they feel related actions
<cbosdonnat> we can keep the unref of them where it is though if it
makes more sense
<cbosdonnat> yep
So yes, the "refactoring" that caused this is commit id '2f3054c22'.
As a test, I altered the code to perform the all the Unref's before the
virStateCleanup and that "resolved" the virt-manager crash; however,
that would seem to conflict with commit id '85c3a1820' which noted that
performing the last Unref(dmn) prior to virStateCleanup would cause
issues for daemonInhibitCallback.
So I moved the Unref(dmn) after virStateCleanup... That of course led to
the virt-manager issue. So I figured I'd give adding this patch in and
that then fixed the virt-manager crash (obviously as we'd be Unref'ing
all those servers we added).
So after all that if I add that sleep into domstats as shown in the [3]
link from above - I don't get a crash, but it also seems to cause the
SIGTERM to be ignored at least the first time through. Running a second
SIGTERM does the proper kill.
So short story made really long, I think the best course of action will
be to add this patch and reorder the Unref()'s (adminProgram thru srv,
but not dmn). It seems to resolve these corner cases, but I'm also open
to other suggestions. Still need to think about it some more too before
posting any patches.
Hi.
I'm not grasp the whole picture yet but I've managed to find out what
triggered the crash. It is not 2f3054c22 where you reordered unrefs but
1fd1b766105 which moves events unregistering from netserver client closing to
netservec client disposing. Before 1fd1b766105 we don't have crash
because clients simply do not get disposed.
As to fixing the crash with this patch I thinks its is coincidence. I want
do dispose netservers early to join rpc threads and it turns out that
disposing also closing clients too and this fixes the problem.
Nikolay