
On 4/2/19 10:45 AM, Christian Ehrhardt wrote:
On Mon, Apr 1, 2019 at 4:35 PM Michal Privoznik <mprivozn@redhat.com> wrote:
On 4/1/19 4:25 PM, Christian Ehrhardt wrote:
Hi, I happened to analyze a bug [1] report I got from a friend and for quite a while it was rather elusive. But I now finally got it reproducible [2] enough to share it with the community.
The TL;DR of what I see is: - an automation with python-libvirt gets a SIGINT - cleanup runs destroy and further undefine - the guest closes FDs due to SIGINT and/or destroy which triggers daemonStreamHandleAbort - those two fight over the lock
There I get libvirtd into a deadlock which ends up with all threads dead [4] and two of them fighting [3] (details) in particular.
The to related stacks summarized are like:
daemonStreamHandleWrite (failing to write) -> daemonStreamHandleAbort (closing things and cleaning up) -> ... virChrdevFDStreamCloseCb virMutexLock(&priv->devs->lock);
# there is code meant to avoid such issues emitting "Unable to close" if a lock is held # but the log doesn't show this triggering with debug enabled
#10 seems triggered via an "undefine" call remoteDispatchDomainUndefine ... -> virChrdevFree ... -> virFDStreamSetInternalCloseCb -> virObjectLock(virFDStreamDataPtr fdst) -> virMutexLock(&obj->lock); # closing all streams of a guest (requiring the same locks)
While that already feels quite close I struggle to see where exactly we'd want to fix it. But finally having a repro-script [2] I hope that someone else here might be able to help me with that.
After all it is a race - on my s390x system it triggers usually <5 tries, while on x86 I have needed up to 18 runs of the test to hang. Given different system configs it might be better or worse for you.
FYI we hit this with libvirt 4.0 initially but libvirt 5.0 was just the same. I haven't built 5.1 or a recent master, but the commits since 5.0 didn't mention any issue that seems related. OTOH I'm willing and able to build and try suggestions if anyone comes up with ideas.
[1]: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096 [2]: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/+attachment/5... [3]: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/comments/3 [4]: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/comments/17
You may want to look at d63c82df8b11b583dec8e72dfb216d8c14783876 (contained in 5.1.0) beause this smells like the issue you're facing.
Thanks Michal, I agree that this appears to be similar. But unfortunately with 5.0 + the full 9 patch series leading into d63c82df still triggers the deadlock that we found. So it seems to be a new issue :-/
As I said before - any further suggestions (on commits to test and/or how to resolve with new changes) are welcome. Thanks in advance!
OKay, how about this then: diff --git a/src/util/virfdstream.c b/src/util/virfdstream.c index 1bc43e20a1..0c0bb3ae78 100644 --- a/src/util/virfdstream.c +++ b/src/util/virfdstream.c @@ -680,6 +680,9 @@ static int virFDStreamCloseInt(virStreamPtr st, bool streamAbort) { virFDStreamDataPtr fdst; + virFDStreamInternalCloseCb icbCb; + void *icbOpaque; + virFDStreamInternalCloseCbFreeOpaque icbFreeOpaque; virStreamEventCallback cb; void *opaque; int ret; @@ -730,10 +733,17 @@ virFDStreamCloseInt(virStreamPtr st, bool streamAbort) /* call the internal stream closing callback */ if (fdst->icbCb) { - /* the mutex is not accessible anymore, as private data is null */ - (fdst->icbCb)(st, fdst->icbOpaque); - if (fdst->icbFreeOpaque) - (fdst->icbFreeOpaque)(fdst->icbOpaque); + icbCb = fdst->icbCb; + icbOpaque = fdst->icbOpaque; + icbFreeOpaque = fdst->icbFreeOpaque; + + virObjectUnlock(fdst); + + (icbCb)(st, icbOpaque); + if (icbFreeOpaque) + (icbFreeOpaque)(icbOpaque); + + virObjectLock(fdst); } if (fdst->dispatching) { Michal